Linux device drivers Notes

General
misc-modules
- hello.c
- hellop.c
- complete.c
- faulty.c
- jiq.c
- kdataalign.c
- kdatasize.c
- sleepy.c
- jit.c
- seq.c
- silly.c
misc-progs
- dataalign.c
- datasize.c
- asynctest.c
- gdbline
- inp.c
- load50.c
- mapcmp.c
- mapper.c
- nbtest.c
- netifdebug.c
- outp.c
- polltest.c
- setconsole.c
- setlevel.c
skull
scull
short
scullc
sculld
scullp
scullv
simple
shortprint
pci
usb
lddbus
sbull
snull
tty

General

Make

KERNELDIR ?= /lib/modules/$(shell uname -r)/build
#  -r, --kernel-release         输出内核发行号
$(MAKE) -C $(KERNELDIR) M=$(PWD) modules

KERNELRELEASE是在内核源码的顶层Makefile中定义的一个变量，在第一次读取执行此Makefile时，KERNELRELEASE没有被定义，所以make将读取执行else之后的内容。如果make的目标是clean，直接执行clean操作，然后结束。当make的目标为all时，-C $(KDIR) 指明跳转到内核源码目录下读取那里的Makefile； M=$(PWD) 表明然后返回到当前目录继续读入、执行当前的Makefile。当从内核源码目录返回时，KERNELRELEASE已被被定义，kbuild也被启动去解析kbuild语法的语句，make将继续读取else之前的内容。else之前的内容为kbuild语法的语句, 指明模块源码中各文件的依赖关系，以及要生成的目标模块名。 mymodule-objs := file1.o file2.o表示mymoudule.o 由file1.o与file2.o 连接生成。obj-m := mymodule.o表示编译连接后将生成mymodule.o模块。

kbuild

# Use make M=dir to specify directory of external module to build 
# Old syntax make ... SUBDIRS=$PWD is still supported 
# Setting the environment variable KBUILD_EXTMOD take precedence 
ifdef SUBDIRS 
KBUILD_EXTMOD ?= $(SUBDIRS) 
endif 
ifdef M //如果没有定义或赋值M，此处M未定义（undefined） 
ifeq ("$(origin M)", "command line") //如果定义了，此句用来判断M是否从命令行来 
KBUILD_EXTMOD := $(M) 
endif

Generate Files

|-- modules.order
|-- Module.symvers
|-- XXX.ko
|-- XXX.mod.c
|-- XXX.mod.o
|-- XXX.o

Some important Data Structures

File Operation

The structure, defined in <linux/fs.h>, is a collection of function pointers. Each open file (represented internally by a file structure, which we will examine shortly) is associated with its own set of functions (by including a field called f_op that points to a file_operations structure).

struct module *owner

The first file_operations field is not an operation at all; it is a pointer to the module that "owns" the structure. This field is used to prevent the module from being unloaded while its operations are in use. Almost all the time, it is simply initialized to THIS_MODULE, a macro defined in <linux/module.h>.

loff_t (*llseek) (struct file *, loff_t, int);

The llseek method is used to change the current read/write position in a file, and the new position is returned as a (positive) return value. The loff_t parameter is a "long offset" and is at least 64 bits wide even on 32-bit platforms. Errors are signaled by a negative return value. If this function pointer is NULL, seek calls will modify the position counter in the file structure (described in Section 3.3.2) in potentially unpredictable ways.

ssize_t (*read) (struct file *, char _ _user *, size_t, loff_t *);

Used to retrieve data from the device. A null pointer in this position causes the read system call to fail with -EINVAL ("Invalid argument"). A nonnegative return value represents the number of bytes successfully read (the return value is a "signed size" type, usually the native integer type for the target platform).

ssize_t (*aio_read)(struct kiocb *, char _ _user *, size_t, loff_t);

Initiates an asynchronous read—a read operation that might not complete before the function returns. If this method is NULL, all operations will be processed (synchronously) by read instead.

ssize_t (*write) (struct file *, const char _ _user *, size_t, loff_t *);

Sends data to the device. If NULL, -EINVAL is returned to the program calling the write system call. The return value, if nonnegative, represents the number of bytes successfully written.

ssize_t (*aio_write)(struct kiocb *, const char _ _user *, size_t, loff_t *);

Initiates an asynchronous write operation on the device.

int (*readdir) (struct file *, void *, filldir_t);

This field should be NULL for device files; it is used for reading directories and is useful only for filesystems.

unsigned int (*poll) (struct file *, struct poll_table_struct *);

The poll method is the back end of three system calls: poll, epoll, and select, all of which are used to query whether a read or write to one or more file descriptors would block. The poll method should return a bit mask indicating whether non-blocking reads or writes are possible, and, possibly, provide the kernel with information that can be used to put the calling process to sleep until I/O becomes possible. If a driver leaves its poll method NULL, the device is assumed to be both readable and writable without blocking.

int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long);

The ioctl system call offers a way to issue device-specific commands (such as formatting a track of a floppy disk, which is neither reading nor writing). Additionally, a few ioctl commands are recognized by the kernel without referring to the fops table. If the device doesn't provide an ioctl method, the system call returns an error for any request that isn't predefined (-ENOTTY, "No such ioctl for device").

int (*mmap) (struct file *, struct vm_area_struct *);

mmap is used to request a mapping of device memory to a process's address space. If this method is NULL, the mmap system call returns -ENODEV.

int (*open) (struct inode *, struct file *);

Though this is always the first operation performed on the device file, the driver is not required to declare a corresponding method. If this entry is NULL, opening the device always succeeds, but your driver isn't notified.

int (*flush) (struct file *);

The flush operation is invoked when a process closes its copy of a file descriptor for a device; it should execute (and wait for) any outstanding operations on the device. This must not be confused with the fsync operation requested by user programs. Currently, flush is used in very few drivers; the SCSI tape driver uses it, for example, to ensure that all data written makes it to the tape before the device is closed. If flush is NULL, the kernel simply ignores the user application request.

int (*release) (struct inode *, struct file *);

This operation is invoked when the file structure is being released. Like open, release can be NULL.

Note that release isn't invoked every time a process calls close. Whenever a file structure is shared (for example, after a fork or a dup), release won't be invoked until all copies are closed. If you need to flush pending data when any copy is closed, you should implement the flush method.

int (*fsync) (struct file *, struct dentry *, int);

This method is the back end of the fsync system call, which a user calls to flush any pending data. If this pointer is NULL, the system call returns -EINVAL.

int (*aio_fsync)(struct kiocb *, int);

This is the asynchronous version of the fsync method.

int (*fasync) (int, struct file *, int);

This operation is used to notify the device of a change in its FASYNC flag. Asynchronous notification is an advanced topic and is described in Chapter 6. The field can be NULL if the driver doesn't support asynchronous notification.

int (*lock) (struct file *, int, struct file_lock *);

The lock method is used to implement file locking; locking is an indispensable feature for regular files but is almost never implemented by device drivers.

ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, loff_t *);

ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, loff_t *);

These methods implement scatter/gather read and write operations. Applications occasionally need to do a single read or write operation involving multiple memory areas; these system calls allow them to do so without forcing extra copy operations on the data. If these function pointers are left NULL, the read and write methods are called (perhaps more than once) instead.

ssize_t (*sendfile)(struct file *, loff_t *, size_t, read_actor_t, void *);

This method implements the read side of the sendfile system call, which moves the data from one file descriptor to another with a minimum of copying. It is used, for example, by a web server that needs to send the contents of a file out a network connection. Device drivers usually leave sendfile NULL.

ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *,

int);

sendpage is the other half of sendfile; it is called by the kernel to send data, one page at a time, to the corresponding file. Device drivers do not usually implement sendpage.

unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned

long, unsigned long, unsigned long);

The purpose of this method is to find a suitable location in the process's address space to map in a memory segment on the underlying device. This task is normally performed by the memory management code; this method exists to allow drivers to enforce any alignment requirements a particular device may have. Most drivers can leave this method NULL.

int (*check_flags)(int)

This method allows a module to check the flags passed to an fcntl(F_SETFL…) call.

int (*dir_notify)(struct file *, unsigned long);

This method is invoked when an application uses fcntl to request directory change notifications. It is useful only to filesystems; drivers need not implement dir_notify.

The scull device driver implements only the most important device methods. Its file_operations structure is initialized as follows:

struct file_operations scull_fops = {

.owner = THIS_MODULE,

.llseek = scull_llseek,

.read = scull_read,

.write = scull_write,

.ioctl = scull_ioctl,

.open = scull_open,

.release = scull_release,

};
The file Structure

struct file, defined in <linux/fs.h>, is the second most important data structure used in device drivers. Note that a file has nothing to do with the FILE pointers of user-space programs. A FILE is defined in the C library and never appears in kernel code. A struct file, on the other hand, is a kernel structure that never appears in user programs.

mode_t f_mode;

The file mode identifies the file as either readable or writable (or both), by means of the bits FMODE_READ and FMODE_WRITE. You might want to check this field for read/write permission in your open or ioctl function, but you don't need to check permissions for read and write, because the kernel checks before invoking your method. An attempt to read or write when the file has not been opened for that type of access is rejected without the driver even knowing about it.

loff_t f_pos;

The current reading or writing position. loff_t is a 64-bit value on all platforms (long long in gcc terminology). The driver can read this value if it needs to know the current position in the file but should not normally change it; read and write should update a position using the pointer they receive as the last argument instead of acting on filp->f_pos directly. The one exception to this rule is in the llseek method, the purpose of which is to change the file position.

unsigned int f_flags;

These are the file flags, such as O_RDONLY, O_NONBLOCK, and O_SYNC. A driver should check the O_NONBLOCK flag to see if nonblocking operation has been requested (we discuss nonblocking I/O in Section 6.2.3); the other flags are seldom used. In particular, read/write permission should be checked using f_mode rather than f_flags. All the flags are defined in the header <linux/fcntl.h>.

struct file_operations *f_op;

The operations associated with the file. The kernel assigns the pointer as part of its implementation of open and then reads it when it needs to dispatch any operations. The value in filp->f_op is never saved by the kernel for later reference; this means that you can change the file operations associated with your file, and the new methods will be effective after you return to the caller. For example, the code for open associated with major number 1 (/dev/null, /dev/zero, and so on) substitutes the operations in filp->f_op depending on the minor number being opened. This practice allows the implementation of several behaviors under the same major number without introducing overhead at each system call. The ability to replace the file operations is the kernel equivalent of "method overriding" in object-oriented programming.

void *private_data;

The open system call sets this pointer to NULL before calling the open method for the driver. You are free to make its own use of the field or to ignore it; you can use the field to point to allocated data, but then you must remember to free that memory in the release method before the file structure is destroyed by the kernel. private_data is a useful resource for preserving state information across system calls and is used by most of our sample modules.

struct dentry *f_dentry;

The directory entry (dentry) structure associated with the file. Device driver writers normally need not concern themselves with dentry structures, other than to access the inode structure as filp->f_dentry->d_inode.
The inode Structure

The inode structure is used by the kernel internally to represent files. Therefore, it is different from the file structure that represents an open file descriptor. There can be numerous file structures representing multiple open descriptors on a single file, but they all point to a single inode structure.

The inode structure contains a great deal of information about the file. As a general rule, only two fields of this structure are of interest for writing driver code:

dev_t i_rdev;

For inodes that represent device files, this field contains the actual device number.

struct cdev *i_cdev;

struct cdev is the kernel's internal structure that represents char devices; this field contains a pointer to that structure when the inode refers to a char device file.

The type of i_rdev changed over the course of the 2.5 development series, breaking a lot of drivers. As a way of encouraging more portable programming, the kernel developers have added two macros that can be used to obtain the major and minor number from an inode:

unsigned int iminor(struct inode *inode);

unsigned int imajor(struct inode *inode);

The dentry Structure

<linux/dcache.h>
struct dentry {
        atomic_t d_count;
        unsigned int d_flags;           /* protected by d_lock */
        spinlock_t d_lock;              /* per dentry lock */
        int d_mounted;
        struct inode *d_inode;          /* Where the name belongs to - NULL is
                                         * negative */
        /*
         * The next three fields are touched by __d_lookup.  Place them here
         * so they all fit in a cache line.
         */
        struct hlist_node d_hash;       /* lookup hash list */
        struct dentry *d_parent;        /* parent directory */
        struct qstr d_name;

        struct list_head d_lru;         /* LRU list */
        /*
         * d_child and d_rcu can share memory
         */
        union {
                struct list_head d_child;       /* child of parent list */
                struct rcu_head d_rcu;
        } d_u;
        struct list_head d_subdirs;     /* our children */
        struct list_head d_alias;       /* inode alias list */
        unsigned long d_time;           /* used by d_revalidate */
        const struct dentry_operations *d_op;
        struct super_block *d_sb;       /* The root of the dentry tree */
        void *d_fsdata;                 /* fs-specific data */

        unsigned char d_iname[DNAME_INLINE_LEN_MIN];    /* small names */
};

The utsname structure

<usr/include/sys/utsname.h>

/* Structure describing the system and machine.  */
struct utsname
  {
    /* Name of the implementation of the operating system.  */
    char sysname[_UTSNAME_SYSNAME_LENGTH];

    /* Name of this node on the network.  */
    char nodename[_UTSNAME_NODENAME_LENGTH];

    /* Current release level of this implementation.  */
    char release[_UTSNAME_RELEASE_LENGTH];
    /* Current version level of this release.  */
    char version[_UTSNAME_VERSION_LENGTH];

    /* Name of the hardware type the system is running on.  */
    char machine[_UTSNAME_MACHINE_LENGTH];

#if _UTSNAME_DOMAIN_LENGTH - 0
    /* Name of the domain of this node on the network.  */
# ifdef __USE_GNU
    char domainname[_UTSNAME_DOMAIN_LENGTH];
# else
    char __domainname[_UTSNAME_DOMAIN_LENGTH];
# endif
#endif
  };

/* Put information about the system in NAME.  */
extern int uname (struct utsname *__name) __THROW;

Files

/proc/moudles

/proc/moudles 是旧式的, 那种信息的单个文件版本. 其中的条目包含了模块名, 每个模块占用的内存数量, 以及使用计数. 另外的字串追加到每行的末尾来指定标志, 对这个模块当前是活动的.

binfmt_misc 6587 1 - Live 0xf82b0000
ppdev 5259 0 - Live 0xf8274000
vboxnetadp 6390 0 - Live 0xf82ac000
vboxnetflt 12740 0 - Live 0xf826e000
vboxdrv 169169 2 vboxnetadp,vboxnetflt, Live 0xf8626000
nfsd 238778 13 - Live 0xf870d000
exportfs 3437 1 nfsd, Live 0xf8308000
nfs 265631 0 - Live 0xf8683000
lockd 64881 2 nfsd,nfs, Live 0xf8614000
nfs_acl 2245 2 nfsd,nfs, Live 0xf82a9000
auth_rpcgss 33767 2 nfsd,nfs, Live 0xf8299000
sunrpc 193609 12 nfsd,nfs,lockd,nfs_acl,auth_rpcgss, Live 0xf85e2000
snd_hda_codec_realtek 203472 1 - Live 0xf9d34000
snd_usb_audio 75861 2 - Live 0xf9cce000
snd_usb_lib 15833 1 snd_usb_audio, Live 0xf9ca7000
snd_hda_intel 22165 4 - Live 0xf9c79000
snd_pcm_oss 35308 0 - Live 0xf9c5c000
snd_hda_codec 74297 2 snd_hda_codec_realtek,snd_hda_intel, Live 0xf9c2e000
snd_mixer_oss 13746 1 snd_pcm_oss, Live 0xf9c09000
snd_pcm 70918 5 snd_usb_audio,snd_hda_intel,snd_pcm_oss,snd_hda_codec, Live 0xf9be5000

/proc/devices

Character devices:
  1 mem
  4 /dev/vc/0
  4 tty
  4 ttyS
  5 /dev/tty
  5 /dev/console
  5 /dev/ptmx
  6 lp
  7 vcs
 10 misc
 13 input
 14 sound
 21 sg
 29 fb
 99 ppdev
108 ppp
116 alsa
Block devices:
  1 ramdisk
259 blkext
  7 loop
  8 sd
  9 md
 11 sr
 65 sd

/sys/module

/sys/module 是一个 sysfs 目录层次, 包含当前加载模块的

/sys/module
|-- 8250
|   `-- parameters
|       |-- nr_uarts
|       |-- probe_rsa
|       |-- share_irqs
|       `-- skip_txen_test
|-- acpi
|   `-- parameters
|       |-- acpica_version
|       |-- bfs
|       |-- gts
|       `-- immediate_undock
|-- acpi_cpufreq
|   `-- parameters
|       `-- acpi_pstate_strict
|-- agpgart
|   |-- holders
|   |   `-- nvidia -> ../../nvidia
|   |-- initstate
|   |-- notes
|   |-- refcnt
|   |-- sections
|   |   |-- __kcrctab
|   |   |-- __kcrctab_gpl
|   |   |-- __ksymtab
|   |   |-- __ksymtab_gpl
|   |   |-- __ksymtab_strings
|   |   `-- __mcount_loc
|   `-- srcversion
.....

Test

sudo insmod xxx.ko   or sudo modprobe xxx.ko
sudo mknod -m og+rw /dev/XXX c MAJNUM 0
or sudo chmod NNN /dev/XXX

misc-modules

`hello.c`

1) MODULE_LICENSE("Dual BSD/GPL");
2) printk(KERN_ALERT "Hello, world\n");
#include <linux/kernel.h>
int printk(const char * fmt, ...);
    内核代码的 printf 类似物.

3) module_init(hello_init);
module_exit(hello_exit);

Check the output message

dmesg | tail

Ref

module_init and module_exit

chapter 2

#include <linux/init.h>

#ifndef MODULE
/**
 * module_init() - driver initialization entry point
 * @x: function to be run at kernel boot time or module insertion
 * 
 * module_init() will either be called during do_initcalls() (if
 * builtin) or at module insertion time (if a module).  There can only
 * be one per module.
 */
#define module_init(x)  __initcall(x);

/**
 * module_exit() - driver exit entry point
 * @x: function to be run when driver is removed
 * 
 * module_exit() will wrap the driver clean-up code
 * with cleanup_module() when used with rmmod when
 * the driver is a module.  If the driver is statically
 * compiled into the kernel, module_exit() has no effect.
 * There can only be one per module.
 */
#define module_exit(x)  __exitcall(x);

#else /* MODULE */

/* Each module must use one module_init(). */
#define module_init(initfn)                                     \
        static inline initcall_t __inittest(void)               \
        { return initfn; }                                      \
        int init_module(void) __attribute__((alias(#initfn)));

/* This is only required if you want to be unloadable. */
#define module_exit(exitfn)                                     \
        static inline exitcall_t __exittest(void)               \
        { return exitfn; }                                      \
        void cleanup_module(void) __attribute__((alias(#exitfn)));
#endif

printk

chapter 2

#include <linux/printk.h>

#define KERN_EMERG      "<0>"   /* system is unusable                   */
#define KERN_ALERT      "<1>"   /* action must be taken immediately     */
#define KERN_CRIT       "<2>"   /* critical conditions                  */
#define KERN_ERR        "<3>"   /* error conditions                     */
#define KERN_WARNING    "<4>"   /* warning conditions                   */
#define KERN_NOTICE     "<5>"   /* normal but significant condition     */
#define KERN_INFO       "<6>"   /* informational                        */
#define KERN_DEBUG      "<7>"   /* debug-level messages                 */

#ifdef CONFIG_PRINTK
asmlinkage int printk(const char * fmt, ...)
        __attribute__ ((format (printf, 1, 2))) __cold;
#else
static inline int printk(const char *s, ...)
        __attribute__ ((format (printf, 1, 2)));
#endif

MODULE_LICENSE("Dual BSD/GPL");

chapter 2

#include <linux/module.h>

/* Generic info of form tag = "info" */
#define MODULE_INFO(tag, info) __MODULE_INFO(tag, tag, info)

/*
 * The following license idents are currently accepted as indicating free
 * software modules
 *
 *      "GPL"                           [GNU Public License v2 or later]
 *      "GPL v2"                        [GNU Public License v2]
 *      "GPL and additional rights"     [GNU Public License v2 rights and more]
 *      "Dual BSD/GPL"                  [GNU Public License v2
 *                                       or BSD license choice]
 *      "Dual MIT/GPL"                  [GNU Public License v2
 *                                       or MIT license choice]
 *      "Dual MPL/GPL"                  [GNU Public License v2
 *                                       or Mozilla license choice]
 *
 * The following other idents are available
 *
 *      "Proprietary"                   [Non free products]
 *
 * There are dual licensed components, but when running with Linux it is the
 * GPL that is relevant so this is a non issue. Similarly LGPL linked with GPL
 * is a GPL combined work.
 *
 * This exists for several reasons
 * 1.   So modinfo can show license info for users wanting to vet their setup 
 *      is free
 * 2.   So the community can ignore bug reports including proprietary modules
 * 3.   So vendors can do likewise based on their own policies
 */
#define MODULE_LICENSE(_license) MODULE_INFO(license, _license)

#include <linux/moduleparam.h>

#ifdef MODULE
#define ___module_cat(a,b) __mod_ ## a ## b
#define __module_cat(a,b) ___module_cat(a,b)
#define __MODULE_INFO(tag, name, info)                                    \
static const char __module_cat(name,__LINE__)[]                           \
  __used __attribute__((section(".modinfo"), unused, aligned(1)))         \
  = __stringify(tag) "=" info
#else  /* !MODULE */
#define __MODULE_INFO(tag, name, info)
#endif

`hellop.c`

SRC

static char *whom = "world";
static int howmany = 1;
module_param(howmany, int, S_IRUGO);
module_param(whom, charp, S_IRUGO);

module_param(name, type, perm)

chapter 2

#include <linux/moduleparam.h>

/**
 * module_param - typesafe helper for a module/cmdline parameter
 * @value: the variable to alter, and exposed parameter name.
 * @type: the type of the parameter
 * @perm: visibility in sysfs.
 *
 * @value becomes the module parameter, or (prefixed by KBUILD_MODNAME and a
 * ".") the kernel commandline parameter.  Note that - is changed to _, so
 * the user can use "foo-bar=1" even for variable "foo_bar".
 *
 * @perm is 0 if the the variable is not to appear in sysfs, or 0444
 * for world-readable, 0644 for root-writable, etc.  Note that if it
 * is writable, you may need to use kparam_block_sysfs_write() around
 * accesses (esp. charp, which can be kfreed when it changes).
 *
 * The @type is simply pasted to refer to a param_ops_##type and a
 * param_check_##type: for convenience many standard types are provided but
 * you can create your own by defining those variables.
 *
 * Standard types are:
 *      byte, short, ushort, int, uint, long, ulong
 *      charp: a character pointer
 *      bool: a bool, values 0/1, y/n, Y/N.
 *      invbool: the above, only sense-reversed (N = true).
 */
#define module_param(name, type, perm)                          \
        module_param_named(name, name, type, perm)

/**
 * module_param_named - typesafe helper for a renamed module/cmdline parameter
 * @name: a valid C identifier which is the parameter name.
 * @value: the actual lvalue to alter.
 * @type: the type of the parameter
 * @perm: visibility in sysfs.
 *
 * Usually it's a good idea to have variable names and user-exposed names the
 * same, but that's harder if the variable must be non-static or is inside a
 * structure.  This allows exposure under a different name.
 */
#define module_param_named(name, value, type, perm)                        \
        param_check_##type(name, &(value));                                \
        module_param_cb(name, &param_ops_##type, &value, perm);            \
        __MODULE_PARM_TYPE(name, #type)

/**
 * module_param_cb - general callback for a module/cmdline parameter
 * @name: a valid C identifier which is the parameter name.
 * @ops: the set & get operations for this parameter.
 * @perm: visibility in sysfs.
 *
 * The ops can have NULL set or get functions.
 */
#define module_param_cb(name, ops, arg, perm)                                 \
        __module_param_call(MODULE_PARAM_PREFIX,                              \
                            name, ops, arg, __same_type((arg), bool *), perm)

/* This is the fundamental function for registering boot/module
   parameters. */
#define __module_param_call(prefix, name, ops, arg, isbool, perm)       \
        /* Default value instead of permissions? */                     \
        static int __param_perm_check_##name __attribute__((unused)) =  \
        BUILD_BUG_ON_ZERO((perm) < 0 || (perm) > 0777 || ((perm) & 2))  \
        + BUILD_BUG_ON_ZERO(sizeof(""prefix) > MAX_PARAM_PREFIX_LEN);   \
        static const char __param_str_##name[] = prefix #name;          \
        static struct kernel_param __moduleparam_const __param_##name   \
        __used                                                          \
    __attribute__ ((unused,__section__ ("__param"),aligned(sizeof(void *)))) \
        = { __param_str_##name, ops, perm, isbool ? KPARAM_ISBOOL : 0,  \
            { arg } }

S_IRUGO

chapter 2

<linux/stat.h>

#if defined(__KERNEL__) || !defined(__GLIBC__) || (__GLIBC__ < 2)
#define S_IRWXU 00700
#define S_IRUSR 00400
#define S_IWUSR 00200
#define S_IXUSR 00100

#define S_IRWXG 00070
#define S_IRGRP 00040
#define S_IWGRP 00020
#define S_IXGRP 00010

#define S_IRWXO 00007
#define S_IROTH 00004
#define S_IWOTH 00002
#define S_IXOTH 00001

#endif


#ifdef __KERNEL__
#define S_IRWXUGO       (S_IRWXU|S_IRWXG|S_IRWXO)
#define S_IALLUGO       (S_ISUID|S_ISGID|S_ISVTX|S_IRWXUGO)
#define S_IRUGO         (S_IRUSR|S_IRGRP|S_IROTH)
#define S_IWUGO         (S_IWUSR|S_IWGRP|S_IWOTH)
#define S_IXUGO         (S_IXUSR|S_IXGRP|S_IXOTH)

#define UTIME_NOW       ((1l << 30) - 1l)
#define UTIME_OMIT      ((1l << 30) - 2l)
#endif

Run

sudo insmod hellop.ko  howmany=10 whom="what"

`complete.c`

struct task_struct *current;

chapter 2.3 Kernel code can refer to the current process by accessing the global item current, defined in <asm/current.h>, which yields a pointer to struct task_struct, defined by <linux/sched.h>.

snippet

current->pid
current->comm
进程 ID 和 当前进程的命令名.

files

<linux/sched.h>
#include <asm/current.h>

 <asm/current.h>
#include <linux/thread_info.h>

 static inline struct task_struct *get_current(void) __attribute_const__;
static inline struct task_struct *get_current(void)
{
        return current_thread_info()->task;
}

#define current (get_current())

<linux/thread_info.h>
static inline struct thread_info *current_thread_info(void) __attribute_const__;

static inline struct thread_info *current_thread_info(void)
{
        register unsigned long sp asm ("sp");
        return (struct thread_info *)(sp & ~(THREAD_SIZE - 1));
}
#define THREAD_SIZE             8192

/*
 * low level task data that entry.S needs immediate access to.
 * __switch_to() assumes cpu_context follows immediately after cpu_domain.
 */
struct thread_info {
        unsigned long           flags;          /* low level flags */
        int                     preempt_count;  /* 0 => preemptable, <0 => bug */
        mm_segment_t            addr_limit;     /* address limit */
        struct task_struct      *task;          /* main task structure */
        struct exec_domain      *exec_domain;   /* execution domain */
        __u32                   cpu;            /* cpu */
        __u32                   cpu_domain;     /* cpu domain */
        struct cpu_context_save cpu_context;    /* cpu context */
        __u32                   syscall;        /* syscall number */
        __u8                    used_cp[16];    /* thread used copro */
        unsigned long           tp_value;
        struct crunch_state     crunchstate;
        union fp_state          fpstate __attribute__((aligned(8)));
        union vfp_state         vfpstate;
#ifdef CONFIG_ARM_THUMBEE
        unsigned long           thumbee_state;  /* ThumbEE Handler Base register */
#endif
        struct restart_block    restart_block;
};

dev_t

chapter 3

#include <linux/types.h>
#ifdef __KERNEL__

typedef __u32 __kernel_dev_t;
typedef __kernel_dev_t          dev_t;

operation

  #include <linux/kdev_t.h>
#ifdef __KERNEL__
#define MINORBITS       20
#define MINORMASK       ((1U << MINORBITS) - 1)

#define MAJOR(dev)      ((unsigned int) ((dev) >> MINORBITS))
#define MINOR(dev)      ((unsigned int) ((dev) & MINORMASK))
#define MKDEV(ma,mi)    (((ma) << MINORBITS) | (mi))

#define print_dev_t(buffer, dev)                                        \
        sprintf((buffer), "%u:%u\n", MAJOR(dev), MINOR(dev))

#define format_dev_t(buffer, dev)                                       \
        ({                                                              \
                sprintf(buffer, "%u:%u", MAJOR(dev), MINOR(dev));       \
                buffer;                                                 \
        })
static inline u32 new_encode_dev(dev_t dev)
{
        unsigned major = MAJOR(dev);
        unsigned minor = MINOR(dev);
        return (minor & 0xff) | (major << 8) | ((minor & ~0xff) << 12);
}

static inline dev_t new_decode_dev(u32 dev)
{
        unsigned major = (dev & 0xfff00) >> 8;
        unsigned minor = (dev & 0xff) | ((dev >> 12) & 0xfff00);
        return MKDEV(major, minor);
}

#else /* __KERNEL__ */
/*
Some programs want their definitions of MAJOR and MINOR and MKDEV
from the kernel sources. These must be the externally visible ones.
*/
#define MAJOR(dev)      ((dev)>>8)
#define MINOR(dev)      ((dev) & 0xff)
#define MKDEV(ma,mi)    ((ma)<<8 | (mi))
#endif /* __KERNEL__ */

Allocating and Freeing Device Numbers
chapter 3
```
#include <linux/fs.h>
The “filesystem” header is the header required for writing device drivers. Many
important functions and data structures are declared in here.

int register_chrdev_region(dev_t first, unsigned int count, char *name)
int alloc_chrdev_region(dev_t *dev, unsigned int firstminor, unsigned int count, char *name)
void unregister_chrdev_region(dev_t first, unsigned int count);
Functions that allow a driver to allocate and free ranges of device numbers.
register_chrdev_region should be used when the desired major number is known
in advance; for dynamic allocation, use alloc_chrdev_region instead.
```
- how
  
  int register_chrdev_region(dev_t first, unsigned int count, char *name);
  
  Here, first is the beginning device number of the range you would like to allocate. The minor number portion of first is often 0, but there is no requirement to that effect. count is the total number of contiguous device numbers you are requesting. Note that, if count is large, the range you request could spill over to the next major number; but everything will still work properly as long as the number range you request is available. Finally, name is the name of the device that should be associated with this number range; it will appear in /proc/devices and sysfs.

Char Device Registration

chapter 3

API

  #include <linux/cdev.h>
  struct cdev {
        struct kobject kobj;
        struct module *owner;
        const struct file_operations *ops;
        struct list_head list;
        dev_t dev;
        unsigned int count;
};

void cdev_init(struct cdev *, const struct file_operations *);
struct cdev *cdev_alloc(void);
int cdev_add(struct cdev *, dev_t, unsigned);
void cdev_del(struct cdev *);

void cdev_put(struct cdev *p);
int cdev_index(struct inode *inode);
void cd_forget(struct inode *);
extern struct backing_dev_info directly_mappable_cdev_bdi;

examples

There are two ways of allocating and initializing one of these structures. If you wish to obtain a standalone cdev structure at runtime, you may do so with code such as:

struct cdev *my_cdev = cdev_alloc( );
my_cdev->ops = &my_fops;

Chances are, however, that you will want to embed the cdev structure within a device-specific structure of your own; that is what scull does. In that case, you should initialize the structure that you have already allocated with:

static void scull_setup_cdev(struct scull_dev *dev, int index)
{
  int err, devno = MKDEV(scull_major, scull_minor + index);
  cdev_init(&dev->cdev, &scull_fops);
  dev->cdev.owner = THIS_MODULE;
  dev->cdev.ops = &scull_fops;
  err = cdev_add (&dev->cdev, devno, 1);
  /* Fail gracefully if need be */
  if (err)
    printk(KERN_NOTICE "Error %d adding scull%d", err, index);
}

how
- int cdev_add(struct cdev *dev, dev_t num, unsigned int count); Here, dev is the cdev structure, num is the first device number to which this device responds, and count is the number of device numbers that should be associated with the device. Often count is one, but there are situations where it makes sense to have more than one device number correspond to a specific device.
  There are a couple of important things to keep in mind when using cdev_add. The first is that this call can fail. If it returns a negative error code, your device has not been added to the system.

Data Structures

struct file_opertaions

The file_operations structure holds a char driver’s methods;

     #include <linux/fs.h>
/*
 * NOTE:
 * all file operations except setlease can be called without
 * the big kernel lock held in all filesystems.
 */
struct file_operations {
        struct module *owner;
        loff_t (*llseek) (struct file *, loff_t, int);
        ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
        ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
        ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
        ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
        int (*readdir) (struct file *, void *, filldir_t);
        unsigned int (*poll) (struct file *, struct poll_table_struct *);
        long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
        long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
        int (*mmap) (struct file *, struct vm_area_struct *);
        int (*open) (struct inode *, struct file *);
        int (*flush) (struct file *, fl_owner_t id);
        int (*release) (struct inode *, struct file *);
        int (*fsync) (struct file *, int datasync);
        int (*aio_fsync) (struct kiocb *, int datasync);
        int (*fasync) (int, struct file *, int);
        int (*lock) (struct file *, int, struct file_lock *);
        ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int);
        unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
        int (*check_flags)(int);
        int (*flock) (struct file *, int, struct file_lock *);
        ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int);
        ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int);
        int (*setlease)(struct file *, long, struct file_lock **);
};

struct file

struct file represents an open file

struct file {
        /*
         * fu_list becomes invalid after file_free is called and queued via
         * fu_rcuhead for RCU freeing
         */
        union {
                struct list_head        fu_list;
                struct rcu_head         fu_rcuhead;
        } f_u;
        struct path             f_path;
#define f_dentry        f_path.dentry
#define f_vfsmnt        f_path.mnt
        const struct file_operations    *f_op;
        spinlock_t              f_lock;  /* f_ep_links, f_flags, no IRQ */
#ifdef CONFIG_SMP
        int                     f_sb_list_cpu;
#endif
        atomic_long_t           f_count;
        unsigned int            f_flags;
        fmode_t                 f_mode;
        loff_t                  f_pos;
        struct fown_struct      f_owner;
        const struct cred       *f_cred;
        struct file_ra_state    f_ra;

        u64                     f_version;
#ifdef CONFIG_SECURITY
        void                    *f_security;
#endif
        /* needed for tty driver, and maybe others */
        void                    *private_data;

#ifdef CONFIG_EPOLL
        /* Used by fs/eventpoll.c to link all the hooks to this file */
        struct list_head        f_ep_links;
#endif /* #ifdef CONFIG_EPOLL */
        struct address_space    *f_mapping;
#ifdef CONFIG_DEBUG_WRITECOUNT
        unsigned long f_mnt_write_state;
#endif
};

struct inode

struct inode represents a file on disk.

struct inode {
        struct hlist_node       i_hash;
        struct list_head        i_wb_list;      /* backing dev IO list */
        struct list_head        i_lru;          /* inode LRU list */
        struct list_head        i_sb_list;
        struct list_head        i_dentry;
        unsigned long           i_ino;
        atomic_t                i_count;
        unsigned int            i_nlink;
        uid_t                   i_uid;
        gid_t                   i_gid;
        dev_t                   i_rdev;
        unsigned int            i_blkbits;
        u64                     i_version;
        loff_t                  i_size;
#ifdef __NEED_I_SIZE_ORDERED
        seqcount_t              i_size_seqcount;
#endif
        struct timespec         i_atime;
        struct timespec         i_mtime;
        struct timespec         i_ctime;
        blkcnt_t                i_blocks;
        unsigned short          i_bytes;
        umode_t                 i_mode;
        spinlock_t              i_lock; /* i_blocks, i_bytes, maybe i_size */
        struct mutex            i_mutex;
        struct rw_semaphore     i_alloc_sem;
        const struct inode_operations   *i_op;
        const struct file_operations    *i_fop; /* former ->i_op->default_file_ops */
        struct super_block      *i_sb;
        struct file_lock        *i_flock;
        struct address_space    *i_mapping;
        struct address_space    i_data;
#ifdef CONFIG_QUOTA
        struct dquot            *i_dquot[MAXQUOTAS];
#endif
        struct list_head        i_devices;
        union {
                struct pipe_inode_info  *i_pipe;
                struct block_device     *i_bdev;
                struct cdev             *i_cdev;
        };

        __u32                   i_generation;

#ifdef CONFIG_FSNOTIFY
        __u32                   i_fsnotify_mask; /* all events this inode cares about */
        struct hlist_head       i_fsnotify_marks;
#endif

        unsigned long           i_state;
        unsigned long           dirtied_when;   /* jiffies of first dirtying */

        unsigned int            i_flags;

#ifdef CONFIG_IMA
        /* protected by i_lock */
        unsigned int            i_readcount; /* struct files open RO */
#endif
        atomic_t                i_writecount;
#ifdef CONFIG_SECURITY
        void                    *i_security;
#endif
#ifdef CONFIG_FS_POSIX_ACL
        struct posix_acl        *i_acl;
        struct posix_acl        *i_default_acl;
#endif
        void                    *i_private; /* fs or device private pointer */
};

struct inode_operations

struct inode_operations {
        int (*create) (struct inode *,struct dentry *,int, struct nameidata *);
        struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameidata *);
        int (*link) (struct dentry *,struct inode *,struct dentry *);
        int (*unlink) (struct inode *,struct dentry *);
        int (*symlink) (struct inode *,struct dentry *,const char *);
        int (*mkdir) (struct inode *,struct dentry *,int);
        int (*rmdir) (struct inode *,struct dentry *);
        int (*mknod) (struct inode *,struct dentry *,int,dev_t);
        int (*rename) (struct inode *, struct dentry *,
                        struct inode *, struct dentry *);
        int (*readlink) (struct dentry *, char __user *,int);
        void * (*follow_link) (struct dentry *, struct nameidata *);
        void (*put_link) (struct dentry *, struct nameidata *, void *);
        void (*truncate) (struct inode *);
        int (*permission) (struct inode *, int);
        int (*check_acl)(struct inode *, int);
        int (*setattr) (struct dentry *, struct iattr *);
        int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *);
        int (*setxattr) (struct dentry *, const char *,const void *,size_t,int);
        ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t);
        ssize_t (*listxattr) (struct dentry *, char *, size_t);
        int (*removexattr) (struct dentry *, const char *);
        void (*truncate_range)(struct inode *, loff_t, loff_t);
        long (*fallocate)(struct inode *inode, int mode, loff_t offset,
                          loff_t len);
        int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start,
                      u64 len);
};

Completions 机制

chapter 5

how

#include <linux/completion.h>
    //Method 1
  DECLARE_COMPLETION(comp);
  wait_for_completion(&comp);
  complete(&comp);
  //Method 2
struct completion my_completion;
init_completion(&my_completion);

API

#include <linux/completion.h>

/ * Atomic wait-for-completion handler data structures.
 * See kernel/sched.c for details.
 */

struct completion {
        unsigned int done;
        wait_queue_head_t wait;
};

#define COMPLETION_INITIALIZER(work) \
        { 0, __WAIT_QUEUE_HEAD_INITIALIZER((work).wait) }

#define COMPLETION_INITIALIZER_ONSTACK(work) \
        ({ init_completion(&work); work; })

/**
 * DECLARE_COMPLETION - declare and initialize a completion structure
 * @work:  identifier for the completion structure
 *
 * This macro declares and initializes a completion structure. Generally used
 * for static declarations. You should use the _ONSTACK variant for automatic
 * variables.
 */
#define DECLARE_COMPLETION(work) \
        struct completion work = COMPLETION_INITIALIZER(work)

/**
 * init_completion - Initialize a dynamically allocated completion
 * @x:  completion structure that is to be initialized
 *
 * This inline function will initialize a dynamically created completion
 * structure.
 */
static inline void init_completion(struct completion *x)
{
        x->done = 0;
        init_waitqueue_head(&x->wait);
}

extern void wait_for_completion(struct completion *);
extern int wait_for_completion_interruptible(struct completion *x);
extern int wait_for_completion_killable(struct completion *x);
extern unsigned long wait_for_completion_timeout(struct completion *x,
                                                   unsigned long timeout);
extern unsigned long wait_for_completion_interruptible_timeout(
                        struct completion *x, unsigned long timeout);
extern unsigned long wait_for_completion_killable_timeout(
                        struct completion *x, unsigned long timeout);
extern bool try_wait_for_completion(struct completion *x);
extern bool completion_done(struct completion *x);

extern void complete(struct completion *);
extern void complete_all(struct completion *);

/**
 * INIT_COMPLETION - reinitialize a completion structure
 * @x:  completion structure to be reinitialized
 *
 * This macro should be used to reinitialize a completion structure so it can
 * be reused. This is especially important after complete_all() is used.
 */
#define INIT_COMPLETION(x)      ((x).done = 0)

`faulty.c`

chapter 4

缓存区溢出

这个方法拷贝一个字串到一个本地变量; 不幸的是, 字串长于目的数组. 当函数返回时导致的缓存区溢出引起一次 oops . 因为返回指令使指令指针到不知何处, 这类的错误很难跟踪

char stack_buf[4];
    /* Let's try a buffer overflow */
    memset(stack_buf, 0xff, 20);

Test

#cat /dev/faulty
[1]    22139 killed     cat /dev/faulty

[178823.762627] BUG: unable to handle kernel NULL pointer dereference at 0000000b
[178823.762631] IP: [<c0214ea9>] vfs_read+0xa9/0x1a0
[178823.762637] *pdpt = 000000002d2cf001 *pde = 0000000000000000 
[178823.762640] Oops: 0000 [#1] SMP 
[178823.762642] last sysfs file: /sys/devices/pci0000:00/0000:00:03.0/0000:01:00.1/local_cpus
[178823.762645] Modules linked in: faulty complete binfmt_misc ppdev vboxnetadp vboxnetflt vboxdrv nfsd exportfs nfs lockd nfs_acl auth_rpcgss sunrpc snd_hda_codec_realtek snd_usb_audio snd_usb_lib snd_hda_intel snd_pcm_oss snd_hda_codec snd_mixer_oss snd_pcm snd_seq_dummy snd_hwdep snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq fbcon snd_timer tileblit snd_seq_device tpm_tis font snd tpm bitblit tpm_bios softcursor soundcore psmouse snd_page_alloc nvidia(P) serio_raw agpgart vga16fb vgastate lp parport usbhid hid usb_storage ahci e1000e
[178823.762681] 
[178823.762684] Pid: 12648, comm: more Tainted: P           (2.6.32-42-generic-pae #96-Ubuntu) 5498RF4
[178823.762693] EIP: 0060:[<c0214ea9>] EFLAGS: 00010202 CPU: 5
[178823.762697] EIP is at vfs_read+0xa9/0x1a0
[178823.762703] EAX: 00000004 EBX: ffffffff ECX: 00000000 EDX: b7487000
[178823.762708] ESI: 00000004 EDI: ffffffff EBP: ffffffff ESP: f166df6c
[178823.762710]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[178823.762712] Process more (pid: 12648, ti=f166c000 task=e251d940 task.ti=f166c000)
[178823.762714] Stack:
[178823.762715]  f166df98 f166df88 c0361b19 f82f80f0 ed0f8e80 ed0f8e80 fffffff7 00000002
[178823.762719] <0> f166dfac c0215052 f166df98 00000000 00000000 00000000 00000003 09ad9c68
[178823.762723] <0> f166c000 c01096e3 00000003 b7487000 00001000 09ad9c68 00000002 bfaa34c4
[178823.762728] Call Trace:
[178823.762732]  [<c0361b19>] ? copy_to_user+0x39/0x130
[178823.762735]  [<f82f80f0>] ? faulty_read+0x0/0x50 [faulty]
[178823.762738]  [<c0215052>] ? sys_read+0x42/0x70
[178823.762741]  [<c01096e3>] ? sysenter_do_call+0x12/0x28
[178823.762743] Code: a4 8b 43 10 8b 40 08 85 c0 89 45 ec 0f 84 e1 00 00 00 8b 45 08 89 f1 89 fa 89 04 24 89 d8 ff 55 ec 89 c6 85 f6 0f 8e 9a 00 00 00 <8b> 7b 0c 31 db 8b 47 10 89 45 f0 0f b7 40 72 c7 44 24 04 00 00 
[178823.762791] EIP: [<c0214ea9>] vfs_read+0xa9/0x1a0 SS:ESP 0068:f166df6c
[178823.762794] CR2: 000000000000000b
[178823.762803] ---[ end trace 9342d36e7d9d6b0e ]---

make a simple fault by dereferencing a NULL pointer

*(int *)0 = 0;

Test

# echo "test" >  /dev/faulty

[179151.985610] BUG: unable to handle kernel NULL pointer dereference at (null)
[179151.985618] IP: [<f82f800a>] faulty_write+0xa/0x20 [faulty]
[179151.985627] *pdpt = 0000000021426001 *pde = 0000000000000000 
[179151.985633] Oops: 0002 [#2] SMP 
[179151.985638] last sysfs file: /sys/devices/pci0000:00/0000:00:03.0/0000:01:00.1/local_cpus
[179151.985642] Modules linked in: faulty complete binfmt_misc ppdev vboxnetadp vboxnetflt vboxdrv nfsd exportfs nfs lockd nfs_acl auth_rpcgss sunrpc snd_hda_codec_realtek snd_usb_audio snd_usb_lib snd_hda_intel snd_pcm_oss snd_hda_codec snd_mixer_oss snd_pcm snd_seq_dummy snd_hwdep snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq fbcon snd_timer tileblit snd_seq_device tpm_tis font snd tpm bitblit tpm_bios softcursor soundcore psmouse snd_page_alloc nvidia(P) serio_raw agpgart vga16fb vgastate lp parport usbhid hid usb_storage ahci e1000e
[179151.985703] 
[179151.985708] Pid: 6614, comm: zsh Tainted: P      D    (2.6.32-42-generic-pae #96-Ubuntu) 5498RF4
[179151.985713] EIP: 0060:[<f82f800a>] EFLAGS: 00010246 CPU: 4
[179151.985718] EIP is at faulty_write+0xa/0x20 [faulty]
[179151.985721] EAX: 00000000 EBX: edaf1600 ECX: 00000005 EDX: 080d5540
[179151.985725] ESI: 00000005 EDI: 080d5540 EBP: e19edf64 ESP: e19edf64
[179151.985729]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[179151.985733] Process zsh (pid: 6614, ti=e19ec000 task=e8fda640 task.ti=e19ec000)
[179151.985736] Stack:
[179151.985738]  e19edf8c c02146f2 e19edf98 edaf1600 edba3b00 f82f8000 e19edf94 edaf1600
[179151.985747] <0> fffffff7 080d5540 e19edfac c0214fe2 e19edf98 00000000 00000000 00000000
[179151.985757] <0> 00000001 00000005 e19ec000 c01096e3 00000001 080d5540 00000005 00000005
[179151.985767] Call Trace:
[179151.985774]  [<c02146f2>] ? vfs_write+0xa2/0x1a0
[179151.985779]  [<f82f8000>] ? faulty_write+0x0/0x20 [faulty]
[179151.985785]  [<c0214fe2>] ? sys_write+0x42/0x70
[179151.985790]  [<c01096e3>] ? sysenter_do_call+0x12/0x28
[179151.985793] Code: <c7> 05 00 00 00 00 00 00 00 00 5d c3 8d 76 00 8d bc 27 00 00 00 00 
[179151.985816] EIP: [<f82f800a>] faulty_write+0xa/0x20 [faulty] SS:ESP 0068:e19edf64
[179151.985822] CR2: 0000000000000000
[179151.985826] ---[ end trace 9342d36e7d9d6b0f ]---

copy_to_user and copy_from_user

chapter 3

#include <asm/uaccess.h>
This include file declares functions used by kernel code to move data to and
from user space.

static inline unsigned long __must_check copy_from_user(void *to, const void __user *from, unsigned long n)
{
        if (access_ok(VERIFY_READ, from, n))
                n = __copy_from_user(to, from, n);
        else /* security hole - plug it */
                memset(to, 0, n);
        return n;
}

static inline unsigned long __must_check copy_to_user(void __user *to, const void *from, unsigned long n)
{
        if (access_ok(VERIFY_WRITE, to, n))
                n = __copy_to_user(to, from, n);
        return n;
}


#ifdef CONFIG_MMU
extern unsigned long __must_check __copy_from_user(void *to, const void __user *from, unsigned long n);
extern unsigned long __must_check __copy_to_user(void __user *to, const void *from, unsigned long n);
extern unsigned long __must_check __copy_to_user_std(void __user *to, const void *from, unsigned long n);
extern unsigned long __must_check __clear_user(void __user *addr, unsigned long n);
extern unsigned long __must_check __clear_user_std(void __user *addr, unsigned long n);
#else
#define __copy_from_user(to,from,n)     (memcpy(to, (void __force *)from, n), 0)
#define __copy_to_user(to,from,n)       (memcpy((void __force *)to, from, n), 0)
#define __clear_user(addr,n)            (memset((void __force *)addr, 0, n), 0)
#endif

`jiq.c`

Error and fix

error

jiq.c:18:26: error: linux/config.h: No such file or directory
jiq.c:122: warning: passing argument 1 of ‘schedule_delayed_work’ from incompatible pointer type
jiq.c:244:46: error: macro "INIT_WORK" passed 3 arguments, but takes just 2
jiq.c:244: error: ‘INIT_WORK’ undeclared (first use in this function)

Fix
从2.6.20的内核开始,INIT_{WORK宏做了改变},原来是三个参数,后来改成了两个参数
In the struct work_struct, the type of work_func_t func is typedef void (*work_func_t)(struct work_struct *work);, so the function should be void XXX(struct work_struct *work)

the example:

#include <linux/workqueue.h>
struct work_struct my_work;
void my_workfunc(struct work_struct *ptr);
INIT_WORK(&my_work, my_workfunc);

the API:

#include <linux/workqueue.h>

typedef void (*work_func_t)(struct work_struct *work);

struct work_struct {
        atomic_long_t data;
        struct list_head entry;
        work_func_t func;
#ifdef CONFIG_LOCKDEP
        struct lockdep_map lockdep_map;
#endif
};

struct delayed_work {
        struct work_struct work;
        struct timer_list timer;
};

#define INIT_WORK(_work, _func)                                 \
        do {                                                    \
                __INIT_WORK((_work), (_func), 0);               \
        } while (0)

#define INIT_WORK_ONSTACK(_work, _func)                         \
        do {                                                    \
                __INIT_WORK((_work), (_func), 1);               \
        } while (0)

#define INIT_DELAYED_WORK(_work, _func)                         \
        do {                                                    \
                INIT_WORK(&(_work)->work, (_func));             \
                init_timer(&(_work)->timer);                    \
        } while (0)

#define INIT_DELAYED_WORK_ONSTACK(_work, _func)                 \
        do {                                                    \
                INIT_WORK_ONSTACK(&(_work)->work, (_func));     \
                init_timer_on_stack(&(_work)->timer);           \
        } while (0)

#define INIT_DELAYED_WORK_DEFERRABLE(_work, _func)              \
        do {                                                    \
                INIT_WORK(&(_work)->work, (_func));             \
                init_timer_deferrable(&(_work)->timer);         \
        } while (0)

the diff

18c18
< /*#include <linux/config.h>*/
---
> #include <linux/config.h>
56c56
< static struct delayed_work jiq_work;
---
> static struct work_struct jiq_work;
83c83
<       struct clientdata *data = (struct clientdata *) ptr;
---
>       struct clientdata *data = ptr;
114c114
< static void jiq_print_wq(struct work_struct *ptr)
---
> static void jiq_print_wq(void *ptr)
116,117c116
<       /*struct clientdata *data = jiq_data;*/
<       
---
>       struct clientdata *data = (struct clientdata *) ptr;
119c118
<       if (! jiq_print (&jiq_data))
---
>       if (! jiq_print (ptr))
122,123c121,122
<       if (jiq_data.delay)
<               schedule_delayed_work(&jiq_work, jiq_data.delay);
---
>       if (data->delay)
>               schedule_delayed_work(&jiq_work, data->delay);
125c124
<               schedule_work(&jiq_work.work);
---
>               schedule_work(&jiq_work);
141c140
<       schedule_work(&jiq_work.work);
---
>       schedule_work(&jiq_work);
154d152
< 
246,247c244
<       printk(KERN_ALERT "jiq init");
<       INIT_DELAYED_WORK(&jiq_work, jiq_print_wq);
---
>       INIT_WORK(&jiq_work, jiq_print_wq, &jiq_data);
253d249
<       
260d255
<       printk(KERN_ALERT "jiq_cleanup");

Test

#cat /proc/jitimer 
    time  delta preempt   pid cpu command
   221012     0       0  3276   4 cat
   221262   250     256     0   4 swapper

jiq_print(&jiq_data);
nit_timer(&jiq_timer);              /* init the timer structure */
        jiq_timer.function = jiq_timedout;
        jiq_timer.data = (unsigned long)&jiq_data;
        jiq_timer.expires = jiffies + HZ; /* one second */
static void jiq_timedout(unsigned long ptr)
{
        jiq_print((void *)ptr);            /* print a line */
        wake_up_interruptible(&jiq_wait);  /* awake the process */
}



#cat /proc/jiqtasklet 
    time  delta preempt   pid cpu command
   231361     0     256    25   7 ksoftirqd/7
   231361     0     256    25   7 ksoftirqd/7
   231361     0     256    25   7 ksoftirqd/7
   231361     0     256    25   7 ksoftirqd/7
   231361     0     256    25   7 ksoftirqd/7
   231361     0     256    25   7 ksoftirqd/7
   231361     0     256    25   7 ksoftirqd/7
   231361     0     256    25   7 ksoftirqd/7
   231361     0     256    25   7 ksoftirqd/7
   231361     0     256    25   7 ksoftirqd/7
   231361     0     256    25   7 ksoftirqd/7
   231361     0     256    25   7 ksoftirqd/7
   231361     0     256    25   7 ksoftirqd/7

        if (jiq_print ((void *) ptr))
                tasklet_schedule (&jiq_tasklet);

#cat /proc/jiqwq
    time  delta preempt   pid cpu command
    69546     0       0    32   5 events/5
    69546     0       0    32   5 events/5
    69546     0       0    32   5 events/5
    69546     0       0    32   5 events/5
    69546     0       0    32   5 events/5
    69546     0       0    32   5 events/5
    69546     0       0    32   5 events/5
    69546     0       0    32   5 events/5
    69546     0       0    32   5 events/5
    69546     0       0    32   5 events/5
    69546     0       0    32   5 events/5
    69546     0       0    32   5 events/5
    69546     0       0    32   5 events/5
    69546     0       0    32   5 events/5
    69546     0       0    32   5 events/5
    69546     0       0    32   5 events/5

        if (jiq_data.delay)
                schedule_delayed_work(&jiq_work, jiq_data.delay);
        else
                schedule_work(&jiq_work.work);

#cat /proc/jiqwqdelay 
    time  delta preempt   pid cpu command
   117793     1       0    33   6 events/6
   117794     1       0    33   6 events/6
   117795     1       0    33   6 events/6
   117796     1       0    33   6 events/6
   117797     1       0    33   6 events/6
   117798     1       0    33   6 events/6
   117799     1       0    33   6 events/6
   117800     1       0    33   6 events/6
   117801     1       0    33   6 events/6
   117802     1       0    33   6 events/6

创建你的 /proc 文件

Chapter 4.3 有使用 /proc 的模块应当包含 <linux/proc_fs.h> 来定义正确的函数.

当一个进程读你的 /proc 文件, 内核分配了一页内存(就是说, PAGE_SIZE 字节), 驱动可以写入数据来返回给用户空间. 那个缓存区传递给你的函数, 是一个称为 read_proc 的方法:

int (*read_proc)(char *page, char **start, off_t offset, int count, int *eof, void *data);

page 指针是你写你的数据的缓存区; start 是这个函数用来说有关的数据写在页中哪里(下面更多关于这个); offset 和 count 对于 read 方法有同样的含义. eof 参数指向一个整数, 必须由驱动设置来指示它不再有数据返回, data 是驱动特定的数据指针, 你可以用做内部用途.

一旦你有一个定义好的 read_proc 函数, 你应当连接它到 /proc 层次中的一个入口项. 使用一个 creat_proc_read_entry 调用:

struct proc_dir_entry *create_proc_read_entry(const char *name,mode_t mode, struct proc_dir_entry *base, read_proc_t *read_proc, void *data);

这里, name 是要创建的文件名子, mod 是文件的保护掩码(缺省系统范围时可以作为 0 传递), base 指出要创建的文件的目录( 如果 base 是 NULL, 文件在 /proc 根下创建 ), read_proc 是实现文件的 read_proc 函数, data 被内核忽略( 但是传递给 read_proc). 这就是 scull 使用的调用, 来使它的 /proc 函数可用做 /proc/scullmem:

create_proc_read_entry("scullmem", 0 /* default mode */, NULL /* parent dir */, scull_read_procmem, NULL /* client data */);

这里, 我们创建了一个名为 scullmem 的文件, 直接在 /proc 下, 带有缺省的, 全局可读的保护.

/proc 中的入口, 当然, 应当在模块卸载后去除. remove_proc_entry 是恢复 create_proc_read_entry 所做的事情的函数:

remove_proc_entry("scullmem", NULL /* parent dir */);

去除入口失败会导致在不希望的时间调用, 或者, 如果你的模块已被卸载, 内核崩掉.

Ref

#include <linux/proc_fs.h>
static inline struct proc_dir_entry *create_proc_read_entry(const char *name,
        mode_t mode, struct proc_dir_entry *base, 
        read_proc_t *read_proc, void * data)
{
        struct proc_dir_entry *res=create_proc_entry(name,mode,base);
        if (res) {
                res->read_proc=read_proc;
                res->data=data;
        }
        return res;
}

extern void remove_proc_entry(const char *name, struct proc_dir_entry *parent);

typedef int (read_proc_t)(char *page, char **start, off_t off,
                          int count, int *eof, void *data);
typedef int (write_proc_t)(struct file *file, const char __user *buffer,
                           unsigned long count, void *data);

struct proc_dir_entry {
        unsigned int low_ino;
        unsigned short namelen;
        const char *name;
        mode_t mode;
        nlink_t nlink;
        uid_t uid;
        gid_t gid;
        loff_t size;
        const struct inode_operations *proc_iops;
        /*
         * NULL ->proc_fops means "PDE is going away RSN" or
         * "PDE is just created". In either case, e.g. ->read_proc won't be
         * called because it's too late or too early, respectively.
         *
         * If you're allocating ->proc_fops dynamically, save a pointer
         * somewhere.
         */
        const struct file_operations *proc_fops;
        struct proc_dir_entry *next, *parent, *subdir;
        void *data;
        read_proc_t *read_proc;
        write_proc_t *write_proc;
        atomic_t count;         /* use count */
        int pde_users;  /* number of callers into module in progress */
        spinlock_t pde_unload_lock; /* proc_fops checks and pde_users bumps */
        struct completion *pde_unload_completion;
        struct list_head pde_openers;   /* who did ->open, but not ->release */
};

Using the jiffies Counter

chapter 7.1 Timer interrupts are generated by the system's timing hardware at regular intervals; this interval is programmed at boot time by the kernel according to the value of HZ, which is an architecture-dependent value defined in <linux/param.h> or a subplatform file included by it. Default values in the distributed kernel source range from 50 to 1200 ticks per second on real hardware, down to 24 for software simulators

Every time a timer interrupt occurs, the value of an internal kernel counter is incremented. The counter is initialized to 0 at system boot, so it represents the number of clock ticks since last boot. The counter is a 64-bit variable (even on 32-bit architectures) and is called jiffies₆₄. However, driver writers normally access the jiffies variable, an unsigned long that is the same as either jiffies₆₄ or its least significant bits. Using jiffies is usually preferred because it is faster, and accesses to the 64-bit jiffies₆₄ value are not necessarily atomic on all architectures.

The counter and the utility functions to read it live in <linux/jiffies.h>, although you'll usually just include <linux/sched.h>, that automatically pulls jiffies.h in.

example

#include <linux/jiffies.h>
unsigned long j, stamp_1, stamp_half, stamp_n;

j = jiffies;                      /* read the current value */
stamp_1    = j + HZ;              /* 1 second in the future */
stamp_half = j + HZ/2;            /* half a second */
stamp_n    = j + n * HZ / 1000;   /* n milliseconds */

To compare your cached value and the current value, you should use one of the following macros:

#include <linux/jiffies.h>
int time_after(unsigned long a, unsigned long b);
int time_before(unsigned long a, unsigned long b);
int time_after_eq(unsigned long a, unsigned long b);
int time_before_eq(unsigned long a, unsigned long b);

/* time_after(a,b) returns true if the time a is after time b. */
#define time_after(a,b)         \
        (typecheck(unsigned long, a) && \
         typecheck(unsigned long, b) && \
         ((long)(b) - (long)(a) < 0))
#define time_before(a,b)        time_after(b,a)

#define time_after_eq(a,b)      \
        (typecheck(unsigned long, a) && \
         typecheck(unsigned long, b) && \
         ((long)(a) - (long)(b) >= 0))
#define time_before_eq(a,b)     time_after_eq(b,a)

Sometimes, however, you need to exchange time representations with user space programs that tend to represent time values with struct timeval and struct timespec. The two structures represent a precise time quantity with two numbers: seconds and microseconds are used in the older and popular struct timeval, and seconds and nanoseconds are used in the newer struct timespec. The kernel exports four helper functions to convert time values expressed as jiffies to and from those structures:

#include <linux/time.h>


unsigned long timespec_to_jiffies(struct timespec *value);
void jiffies_to_timespec(unsigned long jiffies, struct timespec *value);
unsigned long timeval_to_jiffies(struct timeval *value);
void jiffies_to_timeval(unsigned long jiffies, struct timeval *value);

<linux/time.h>
 struct timeval {
         __kernel_time_t         tv_sec;         /* seconds */
         __kernel_suseconds_t    tv_usec;        /* microseconds */
 };

 struct timespec {
         __kernel_time_t tv_sec;                 /* seconds */
         long            tv_nsec;                /* nanoseconds */
 };

<kernel/time.c>
/*
 * The TICK_NSEC - 1 rounds up the value to the next resolution.  Note
 * that a remainder subtract here would not do the right thing as the
 * resolution values don't fall on second boundries.  I.e. the line:
 * nsec -= nsec % TICK_NSEC; is NOT a correct resolution rounding.
 *
 * Rather, we just shift the bits off the right.
 *
 * The >> (NSEC_JIFFIE_SC - SEC_JIFFIE_SC) converts the scaled nsec
 * value to a scaled second value.
 */
unsigned long
timespec_to_jiffies(const struct timespec *value)
{
        unsigned long sec = value->tv_sec;
        long nsec = value->tv_nsec + TICK_NSEC - 1;

        if (sec >= MAX_SEC_IN_JIFFIES){
                sec = MAX_SEC_IN_JIFFIES;
                nsec = 0;
        }
        return (((u64)sec * SEC_CONVERSION) +
                (((u64)nsec * NSEC_CONVERSION) >>
                 (NSEC_JIFFIE_SC - SEC_JIFFIE_SC))) >> SEC_JIFFIE_SC;

}
EXPORT_SYMBOL(timespec_to_jiffies);

void
jiffies_to_timespec(const unsigned long jiffies, struct timespec *value)
{
        /*
         * Convert jiffies to nanoseconds and separate with
         * one divide.
         */
        u32 rem;
        value->tv_sec = div_u64_rem((u64)jiffies * TICK_NSEC,
                                    NSEC_PER_SEC, &rem);
        value->tv_nsec = rem;
}
EXPORT_SYMBOL(jiffies_to_timespec);

/* Same for "timeval"
 *
 * Well, almost.  The problem here is that the real system resolution is
 * in nanoseconds and the value being converted is in micro seconds.
 * Also for some machines (those that use HZ = 1024, in-particular),
 * there is a LARGE error in the tick size in microseconds.

 * The solution we use is to do the rounding AFTER we convert the
 * microsecond part.  Thus the USEC_ROUND, the bits to be shifted off.
 * Instruction wise, this should cost only an additional add with carry
 * instruction above the way it was done above.
 */
unsigned long
timeval_to_jiffies(const struct timeval *value)
{
        unsigned long sec = value->tv_sec;
        long usec = value->tv_usec;

        if (sec >= MAX_SEC_IN_JIFFIES){
                sec = MAX_SEC_IN_JIFFIES;
                usec = 0;
        }
        return (((u64)sec * SEC_CONVERSION) +
                (((u64)usec * USEC_CONVERSION + USEC_ROUND) >>
                 (USEC_JIFFIE_SC - SEC_JIFFIE_SC))) >> SEC_JIFFIE_SC;
}
EXPORT_SYMBOL(timeval_to_jiffies);

void jiffies_to_timeval(const unsigned long jiffies, struct timeval *value)
{
        /*
         * Convert jiffies to nanoseconds and separate with
         * one divide.
         */
        u32 rem;

        value->tv_sec = div_u64_rem((u64)jiffies * TICK_NSEC,
                                    NSEC_PER_SEC, &rem);
        value->tv_usec = rem / NSEC_PER_USEC;
}
EXPORT_SYMBOL(jiffies_to_timeval);


<linux/jiffies.h>

/*
 * We want to do realistic conversions of time so we need to use the same
 * values the update wall clock code uses as the jiffies size.  This value
 * is: TICK_NSEC (which is defined in timex.h).  This
 * is a constant and is in nanoseconds.  We will use scaled math
 * with a set of scales defined here as SEC_JIFFIE_SC,  USEC_JIFFIE_SC and
 * NSEC_JIFFIE_SC.  Note that these defines contain nothing but
 * constants and so are computed at compile time.  SHIFT_HZ (computed in
 * timex.h) adjusts the scaling for different HZ values.

 * Scaled math???  What is that?
 *
 * Scaled math is a way to do integer math on values that would,
 * otherwise, either overflow, underflow, or cause undesired div
 * instructions to appear in the execution path.  In short, we "scale"
 * up the operands so they take more bits (more precision, less
 * underflow), do the desired operation and then "scale" the result back
 * by the same amount.  If we do the scaling by shifting we avoid the
 * costly mpy and the dastardly div instructions.

 * Suppose, for example, we want to convert from seconds to jiffies
 * where jiffies is defined in nanoseconds as NSEC_PER_JIFFIE.  The
 * simple math is: jiff = (sec * NSEC_PER_SEC) / NSEC_PER_JIFFIE; We
 * observe that (NSEC_PER_SEC / NSEC_PER_JIFFIE) is a constant which we
 * might calculate at compile time, however, the result will only have
 * about 3-4 bits of precision (less for smaller values of HZ).
 *
 * So, we scale as follows:
 * jiff = (sec) * (NSEC_PER_SEC / NSEC_PER_JIFFIE);
 * jiff = ((sec) * ((NSEC_PER_SEC * SCALE)/ NSEC_PER_JIFFIE)) / SCALE;
 * Then we make SCALE a power of two so:
 * jiff = ((sec) * ((NSEC_PER_SEC << SCALE)/ NSEC_PER_JIFFIE)) >> SCALE;
 * Now we define:
 * #define SEC_CONV = ((NSEC_PER_SEC << SCALE)/ NSEC_PER_JIFFIE))
 * jiff = (sec * SEC_CONV) >> SCALE;
 *
 * Often the math we use will expand beyond 32-bits so we tell C how to
 * do this and pass the 64-bit result of the mpy through the ">> SCALE"
 * which should take the result back to 32-bits.  We want this expansion
 * to capture as much precision as possible.  At the same time we don't
 * want to overflow so we pick the SCALE to avoid this.  In this file,
 * that means using a different scale for each range of HZ values (as
 * defined in timex.h).
 *
 * For those who want to know, gcc will give a 64-bit result from a "*"
 * operator if the result is a long long AND at least one of the
 * operands is cast to long long (usually just prior to the "*" so as
 * not to confuse it into thinking it really has a 64-bit operand,
 * which, buy the way, it can do, but it takes more code and at least 2
 * mpys).

 * We also need to be aware that one second in nanoseconds is only a
 * couple of bits away from overflowing a 32-bit word, so we MUST use
 * 64-bits to get the full range time in nanoseconds.

 */

/*
 * Here are the scales we will use.  One for seconds, nanoseconds and
 * microseconds.
 *
 * Within the limits of cpp we do a rough cut at the SEC_JIFFIE_SC and
 * check if the sign bit is set.  If not, we bump the shift count by 1.
 * (Gets an extra bit of precision where we can use it.)
 * We know it is set for HZ = 1024 and HZ = 100 not for 1000.
 * Haven't tested others.

 * Limits of cpp (for #if expressions) only long (no long long), but
 * then we only need the most signicant bit.
 */

#define SEC_JIFFIE_SC (31 - SHIFT_HZ)
#if !((((NSEC_PER_SEC << 2) / TICK_NSEC) << (SEC_JIFFIE_SC - 2)) & 0x80000000)
#undef SEC_JIFFIE_SC
#define SEC_JIFFIE_SC (32 - SHIFT_HZ)
#endif
#define NSEC_JIFFIE_SC (SEC_JIFFIE_SC + 29)
#define USEC_JIFFIE_SC (SEC_JIFFIE_SC + 19)
#define SEC_CONVERSION ((unsigned long)((((u64)NSEC_PER_SEC << SEC_JIFFIE_SC) +\
                                TICK_NSEC -1) / (u64)TICK_NSEC))

#define NSEC_CONVERSION ((unsigned long)((((u64)1 << NSEC_JIFFIE_SC) +\
                                        TICK_NSEC -1) / (u64)TICK_NSEC))
#define USEC_CONVERSION  \
                    ((unsigned long)((((u64)NSEC_PER_USEC << USEC_JIFFIE_SC) +\
                                        TICK_NSEC -1) / (u64)TICK_NSEC))
/*
 * USEC_ROUND is used in the timeval to jiffie conversion.  See there
 * for more details.  It is the scaled resolution rounding value.  Note
 * that it is a 64-bit value.  Since, when it is applied, we are already
 * in jiffies (albit scaled), it is nothing but the bits we will shift
 * off.
 */
#define USEC_ROUND (u64)(((u64)1 << USEC_JIFFIE_SC) - 1)
/*
 * The maximum jiffie value is (MAX_INT >> 1).  Here we translate that
 * into seconds.  The 64-bit case will overflow if we are not careful,
 * so use the messy SH_DIV macro to do it.  Still all constants.
 */
#if BITS_PER_LONG < 64
# define MAX_SEC_IN_JIFFIES \
        (long)((u64)((u64)MAX_JIFFY_OFFSET * TICK_NSEC) / NSEC_PER_SEC)
#else   /* take care of overflow on 64 bits machines */
# define MAX_SEC_IN_JIFFIES \
        (SH_DIV((MAX_JIFFY_OFFSET >> SEC_JIFFIE_SC) * TICK_NSEC, NSEC_PER_SEC, 1) - 1)

#endif

<linux/time.h>
/* Parameters used to convert the timespec values: */
#define MSEC_PER_SEC    1000L
#define USEC_PER_MSEC   1000L
#define NSEC_PER_USEC   1000L
#define NSEC_PER_MSEC   1000000L
#define USEC_PER_SEC    1000000L
#define NSEC_PER_SEC    1000000000L
#define FSEC_PER_SEC    1000000000000000LL

Ref

<linux/jiffies.h>
/* some arch's have a small-data section that can be accessed register-relative
 * but that can only take up to, say, 4-byte variables. jiffies being part of
 * an 8-byte variable may not be correctly accessed unless we force the issue
 */
#define __jiffy_data  __attribute__((section(".data")))

/*
 * The 64-bit value is not atomic - you MUST NOT read it
 * without sampling the sequence number in xtime_lock.
 * get_jiffies_64() will do this for you as appropriate.
 */
extern u64 __jiffy_data jiffies_64;
extern unsigned long volatile __jiffy_data jiffies;

#if (BITS_PER_LONG < 64)
u64 get_jiffies_64(void);
#else
static inline u64 get_jiffies_64(void)
{
        return (u64)jiffies;
}
#endif

#include <asm/param.h>                  /* for HZ */

#ifdef __KERNEL__
# define HZ             CONFIG_HZ       /* Internal kernel timer frequency */
# define USER_HZ        100             /* User interfaces are in "ticks" */
# define CLOCKS_PER_SEC (USER_HZ)       /* like times() */
#else
# define HZ             100
#endif

In dm816x_defconfig

CONFIG_HZ=100

Sleeping

chapter 6.2

The first of these rules is: never sleep when you are running in an atomic context. An atomic context is simply a state where multiple steps must be performed without any sort of concurrent access. What that means, with regard to sleeping, is that your driver cannot sleep while holding a spinlock, seqlock, or RCU lock. You also cannot sleep if you have disabled interrupts. It is legal to sleep while holding a semaphore, but you should look very carefully at any code that does so.
Another thing to remember with sleeping is that, when you wake up, you never know how long your process may have been out of the CPU or what may have changed in the mean time. You also do not usually know if another process may have been sleeping for the same event; that process may wake before you and grab whatever resource you were waiting for.
One other relevant point, of course, is that your process cannot sleep unless it is assured that somebody else, somewhere, will wake it up.

In Linux, a wait queue is managed by means of a "wait queue head," a structure of type wait_queue_head_t, which is defined in <linux/wait.h>. A wait queue head can be defined and initialized statically with:

DECLARE_WAIT_QUEUE_HEAD(name);

or dynamicly as follows:

wait_queue_head_t my_queue;
init_waitqueue_head(&my_queue);

API

#include <linux/wait.h>
typedef struct { /* ... */ } wait_queue_head_t;
void init_waitqueue_head(wait_queue_head_t *queue);

DECLARE_WAIT_QUEUE_HEAD(queue);
    The defined type for Linux wait queues. A wait_queue_head_t must
    be explicitly initialized with either init_waitqueue_head at
    runtime or DECLARE_WAIT_QUEUE_HEAD at compile time.

void wait_event(wait_queue_head_t q, int condition);
int wait_event_interruptible(wait_queue_head_t q, int condition);
int wait_event_timeout(wait_queue_head_t q, int condition, int time);
int wait_event_interruptible_timeout(wait_queue_head_t q, int condition, int time);
    Cause the process to sleep on the given queue until the given condition evaluates to a true value.

void wake_up(struct wait_queue_head_t **q);
void wake_up_interruptible(struct wait_queue_head_t **q);
void wake_up_nr(struct wait_queue_head_t **q, int nr);
void wake_up_interruptible_nr(struct wait_queue_head_t **q, int nr);
void wake_up_all(struct wait_queue_head_t **q);
void wake_up_interruptible_all(struct wait_queue_head_t **q);
void wake_up_interruptible_sync(struct wait_queue_head_t **q);
    Wake processes that are sleeping on the queue q. The _interruptible form wakes only interruptible processes. Normally, only one exclusive waiter is awakened, but that behavior can be changed with the _nr or _all forms. The _sync version does not reschedule the CPU before returning.

#include <linux/sched.h>
set_current_state(int state);
    Sets the execution state of the current process. TASK_RUNNING means it is ready to run, while the sleep states are TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE.

void schedule(void);
    Selects a runnable process from the run queue. The chosen process can be current or a different one.

typedef struct { /* ... */ } wait_queue_t;
init_waitqueue_entry(wait_queue_t *entry, struct task_struct *task);
    The wait_queue_t type is used to place a process onto a wait queue.

void prepare_to_wait(wait_queue_head_t *queue, wait_queue_t *wait, int state);
void prepare_to_wait_exclusive(wait_queue_head_t *queue, wait_queue_t *wait, int state);
void finish_wait(wait_queue_head_t *queue, wait_queue_t *wait);
    Helper functions that can be used to code a manual sleep.

void sleep_on(wait_queue_head_t *queue);
void interruptible_sleep_on(wait_queue_head_t *queue);
    Obsolete and deprecated functions that unconditionally put the current process to sleep.

Ref

typedef struct __wait_queue wait_queue_t;
typedef int (*wait_queue_func_t)(wait_queue_t *wait, unsigned mode, int flags, void *key);
int default_wake_function(wait_queue_t *wait, unsigned mode, int flags, void *key);

struct __wait_queue {
        unsigned int flags;
#define WQ_FLAG_EXCLUSIVE       0x01
        void *private;
        wait_queue_func_t func;
        struct list_head task_list;
};

struct __wait_queue_head {
        spinlock_t lock;
        struct list_head task_list;
};
typedef struct __wait_queue_head wait_queue_head_t;

Tasklets

chapter 7.5 A tasklet exists as a data structure that must be initialized before use. Initialization can be performed by calling a specific function or by declaring the structure using certain macros:

#include <linux/interrupt.h>

struct tasklet_struct {
      /* ... */
      void (*func)(unsigned long);
      unsigned long data;
};

void tasklet_init(struct tasklet_struct *t, void (*func)(unsigned long), unsigned long data);
DECLARE_TASKLET(name, func, data);
DECLARE_TASKLET_DISABLED(name, func, data);

Tasklets offer a number of interesting features:

A tasklet can be disabled and re-enabled later; it won't be executed until it is enabled as many times as it has been disabled.
Just like timers, a tasklet can reregister itself.
A tasklet can be scheduled to execute at normal priority or high priority. The latter group is always executed first.
Tasklets may be run immediately if the system is not under heavy load but never later than the next timer tick.
A tasklets can be concurrent with other tasklets but is strictly serialized with respect to itself—the same tasklet never runs simultaneously on more than one processor. Also, as already noted, a tasklet always runs on the same CPU that schedules it.

API

#include <linux/interrupt.h>
DECLARE_TASKLET(name, func, data);
DECLARE_TASKLET_DISABLED(name, func, data);
void tasklet_init(struct tasklet_struct *t, void (*func)(unsigned long), unsigned long data);
    The first two macros declare a tasklet structure, while the
tasklet_init function initializes a tasklet structure that has been
obtained by allocation or other means. The second DECLARE macro marks
the tasklet as disabled.

void tasklet_disable(struct tasklet_struct *t);
void tasklet_disable_nosync(struct tasklet_struct *t);
void tasklet_enable(struct tasklet_struct *t);
    Disables and reenables a tasklet. Each disable must be matched
with an enable (you can disable the tasklet even if it's already
disabled). The function tasklet_disable waits for the tasklet to
terminate if it is running on another CPU. The nosync version doesn't
take this extra step.

void tasklet_schedule(struct tasklet_struct *t);
void tasklet_hi_schedule(struct tasklet_struct *t);
    Schedules a tasklet to run, either as a "normal" tasklet or a
high-priority one. When soft interrupts are executed, high-priority
tasklets are dealt with first, while normal tasklets run last.

void tasklet_kill(struct tasklet_struct *t);
    Removes the tasklet from the list of active ones, if it's
    scheduled to run. Like tasklet_disable, the function may block on
    SMP systems waiting for the tasklet to terminate if it's currently
    running on another CPU.

Ref

#include <linux/interrupt.h>
/* Tasklets --- multithreaded analogue of BHs.

   Main feature differing them of generic softirqs: tasklet
   is running only on one CPU simultaneously.

   Main feature differing them of BHs: different tasklets
   may be run simultaneously on different CPUs.

   Properties:
   * If tasklet_schedule() is called, then tasklet is guaranteed
     to be executed on some cpu at least once after this.
   * If the tasklet is already scheduled, but its excecution is still not
     started, it will be executed only once.
   * If this tasklet is already running on another CPU (or schedule is called
     from tasklet itself), it is rescheduled for later.
   * Tasklet is strictly serialized wrt itself, but not
     wrt another tasklets. If client needs some intertask synchronization,
     he makes it with spinlocks.
 */

struct tasklet_struct
{
        struct tasklet_struct *next;
        unsigned long state;
        atomic_t count;
        void (*func)(unsigned long);
        unsigned long data;
};

#define DECLARE_TASKLET(name, func, data) \
struct tasklet_struct name = { NULL, 0, ATOMIC_INIT(0), func, data }

#define DECLARE_TASKLET_DISABLED(name, func, data) \
struct tasklet_struct name = { NULL, 0, ATOMIC_INIT(1), func, data }

Workqueues

chapter 7.6 The key difference between the two is that tasklets execute quickly, for a short period of time, and in atomic mode, while workqueue functions may have higher latency but need not be atomic. Each mechanism has situations where it is appropriate.

Normal queue

create a workqueue Workqueues have a type of struct workqueue_struct, which is defined in <linux/workqueue.h>. A workqueue must be explicitly created before use, using one of the following two functions:

struct workqueue_struct *create_workqueue(const char *name);
struct workqueue_struct *create_singlethread_workqueue(const char *name);

submit a task to a workqueue
- To submit a task to a workqueue, you need to fill in a work_struct structure. This can be done at compile time as follows:

DECLARE_WORK(name, void (*function)(void *), void *data);

If you need to set up the work_struct structure at runtime, use the following two macros:

INIT_WORK(struct work_struct *work, void (*function)(void *), void *data);
PREPARE_WORK(struct work_struct *work, void (*function)(void *), void *data);

There are two functions for submitting work to a workqueue:

int queue_work(struct workqueue_struct *queue, struct work_struct *work);
int queue_delayed_work(struct workqueue_struct *queue, struct delayed_work *work, unsigned long delay);

cancel a pending workqueue entry

Should you need to cancel a pending workqueue entry, you may call:
int cancel_delayed_work(struct work_struct *work);

To be absolutely sure that the work function is not running anywhere in the system after cancel_delayed_work returns 0, you must follow that call with a call to:
void flush_workqueue(struct workqueue_struct *queue);

destroy a workqueue

void destroy_workqueue(struct workqueue_struct *queue);

Shared Queue
If you only submit tasks to the queue occasionally, it may be more efficient to simply use the shared, default workqueue that is provided by the kernel. If you use this queue, however, you must be aware that you will be sharing it with others.
```
prepare_to_wait(&jiq_wait, &wait, TASK_INTERRUPTIBLE);
schedule_work(&jiq_work);
schedule(  );
finish_wait(&jiq_wait, &wait);
```

API

#include <linux/workqueue.h>

struct workqueue_struct;
struct work_struct;
    The structures representing a workqueue and a work entry, respectively.

struct workqueue_struct *create_workqueue(const char *name);
struct workqueue_struct *create_singlethread_workqueue(const char *name);
void destroy_workqueue(struct workqueue_struct *queue);
    Functions for creating and destroying workqueues. A call to create_workqueue creates a queue with a worker thread on each processor in the system; instead, create_singlethread_workqueue creates a workqueue with a single worker process.

DECLARE_WORK(name, void (*function)(void *));
INIT_WORK(struct work_struct *work, void (*function)(void *);
PREPARE_WORK(struct work_struct *work, void (*function)(void *));
    Macros that declare and initialize workqueue entries.

int queue_work(struct workqueue_struct *queue, struct work_struct *work);
int queue_delayed_work(struct workqueue_struct *queue, struct delayed_work *work, unsigned long delay);
    Functions that queue work for execution from a workqueue.

bool cancel_delayed_work_sync(struct delayed_work *dwork);
void flush_workqueue(struct workqueue_struct *queue);
    Use cancel_delayed_work to remove an entry from a workqueue; flush_workqueue ensures that no workqueue entries are running anywhere in the system.

int schedule_work(struct work_struct *work);
int schedule_delayed_work(struct delayed_work *work, unsigned long delay);
void flush_scheduled_work(void);
    Functions for working with the shared workqueue.

Ref

#include <linux/workqueue.h>

typedef void (*work_func_t)(struct work_struct *work);

struct work_struct {
        atomic_long_t data;
        struct list_head entry;
        work_func_t func;
#ifdef CONFIG_LOCKDEP
        struct lockdep_map lockdep_map;
#endif
};

struct delayed_work {
        struct work_struct work;
        struct timer_list timer;
};

#define create_workqueue(name)                                  \
        alloc_workqueue((name), WQ_MEM_RECLAIM, 1)
#define create_freezeable_workqueue(name)                       \
        alloc_workqueue((name), WQ_FREEZEABLE | WQ_UNBOUND | WQ_MEM_RECLAIM, 1)
#define create_singlethread_workqueue(name)                     \
        alloc_workqueue((name), WQ_UNBOUND | WQ_MEM_RECLAIM, 1)

extern void destroy_workqueue(struct workqueue_struct *wq);

Kernel Timers

Chapter 7.4 A kernel timer is a data structure that instructs the kernel to execute a user-defined function with a user-defined argument at a user-defined time. The implementation resides in <linux/timer.h> and kernel/timer.c.

In fact, kernel timers are run as the result of a "software interrupt." When running in this sort of atomic context, your code is subject to a number of constraints. Timer functions must be atomic in all the ways

A number of actions require the context of a process in order to be executed. When you are outside of process context (i.e., in interrupt context), you must observe the following rules:

No access to user space is allowed. Because there is no process context, there is no path to the user space associated with any particular process.
The current pointer is not meaningful in atomic mode and cannot be used since the relevant code has no connection with the process that has been interrupted.
No sleeping or scheduling may be performed. Atomic code may not call schedule or a form of wait_event, nor may it call any other function that could sleep. For example, calling kmalloc(…, GFP_KERNEL) is against the rules. Semaphores also must not be used since they can sleep

Kernel code can tell if it is running in interrupt context by calling the function in_interrupt( ), which takes no parameters and returns nonzero if the processor is currently running in interrupt context, either hardware interrupt or software interrupt.

API

#include <asm/hardirq.h>
int in_interrupt(void);
int in_atomic(void);
    Returns a Boolean value telling whether the calling code is executing in interrupt context or atomic context. Interrupt context is outside of a process context, either during hardware or software interrupt processing. Atomic context is when you can't schedule either an interrupt context or a process's context with a spinlock held.

#include <linux/timer.h>
struct timer_list {
        /* ... */
        unsigned long expires;
        void (*function)(unsigned long);
        unsigned long data;
};
The expires field represents the jiffies value when the timer is
        expected to run; at that time, the function function is called
        with data as an argument.


void init_timer(struct timer_list * timer);
struct timer_list TIMER_INITIALIZER(_function, _expires, _data);
    This function and the static declaration of the timer structure are the two ways to initialize a timer_list data structure.

void add_timer(struct timer_list * timer);
    Registers the timer structure to run on the current CPU.

int mod_timer(struct timer_list *timer, unsigned long expires);
    Changes the expiration time of an already scheduled timer structure. It can also act as an alternative to add_timer.

int timer_pending(struct timer_list * timer);
Macro that returns a Boolean value stating whether the timer structure is already registered to run.

void del_timer(struct timer_list * timer);
void del_timer_sync(struct timer_list * timer);
    Removes a timer from the list of active timers. The latter function ensures that the timer is not currently running on another CPU.

Ref

struct timer_list {
        /*
         * All fields that change during normal runtime grouped to the
         * same cacheline
         */
        struct list_head entry;
        unsigned long expires;
        struct tvec_base *base;

        void (*function)(unsigned long);
        unsigned long data;

        int slack;

#ifdef CONFIG_TIMER_STATS
        void *start_site;
        char start_comm[16];
        int start_pid;
#endif
#ifdef CONFIG_LOCKDEP
        struct lockdep_map lockdep_map;
#endif
};

`kdataalign.c`

error and fix

error

error: ‘system_utsname’ undeclared (first use in this function)

#define system_utsname init_uts_ns.name

Ref:

<linux/utsname.h>
extern struct uts_namespace init_uts_ns;

struct uts_namespace {
        struct kref kref;
        struct new_utsname name;
};

struct new_utsname {
        char sysname[__NEW_UTS_LEN + 1];
        char nodename[__NEW_UTS_LEN + 1];
        char release[__NEW_UTS_LEN + 1];
        char version[__NEW_UTS_LEN + 1];
        char machine[__NEW_UTS_LEN + 1];
        char domainname[__NEW_UTS_LEN + 1];
};

Test

align of char

struct c   {char c;  char      t;} c;
struct s   {char c;  short     t;} s;
struct i   {char c;  int       t;} i;
struct l   {char c;  long      t;} l;
struct ll  {char c;  long long t;} ll;
struct p   {char c;  void *    t;} p;
struct u1b {char c;  __u8      t;} u1b;
struct u2b {char c;  __u16     t;} u2b;
struct u4b {char c;  __u32     t;} u4b;
struct u8b {char c;  __u64     t;} u8b;

(int)((void *)(&c.t)   - (void *)&c),

Result

[19232.053968] arch  Align:  char  short  int  long   ptr long-long  u8 u16 u32 u64
[19232.053972] i686            1     2     4     4     4     4        1   2   4   4

align of other type

struct c2   {char c;  char      t;} c2;
struct s2   {short c;  char     t;} s2;
struct i2   {int  c;  char       t;} i2;
struct l2   {long c;  char       t;} l2;
struct ll2  {long long c;  char  t;} ll2;
struct p2   {void * c;  char  t;} p2;
struct u1b2 {__u8 c;  char   t;} u1b2;
struct u2b2 {__u16 c; char   t;} u2b2;
struct u4b2 {__u32 c; char   t;} u4b2;
struct u8b2 {__u64 c; char   t;} u8b2;
        (int)((void *)(&c2.t)   - (void *)&c2),

Result

[20443.346930] arch  Align:  char  short  int  long   ptr long-long  u8 u16 u32 u64
[20443.346933] i686            1     2     4     4     4     8        1   2   4   8

compare

[20443.346924] arch  Align:  char  short  int  long   ptr long-long  u8 u16 u32 u64
[20443.346929] i686  char      1     2     4     4     4     4        1   2   4   4
[20443.346930] arch  Align:  char  short  int  long   ptr long-long  u8 u16 u32 u64
[20443.346933] i686  Other     1     2     4     4     4     8        1   2   4   8

`kdatasize.c`

(int)sizeof(char), (int)sizeof(short), (int)sizeof(int),
                (int)sizeof(long),
                (int)sizeof(void *), (int)sizeof(long long), (int)sizeof(__u8),
                (int)sizeof(__u16), (int)sizeof(__u32), (int)sizeof(__u64));

[19790.346939] arch   Size:  char  short  int  long   ptr long-long  u8 u16 u32 u64
[19790.346946] i686            1     2     4     4     4     8        1   2   4   8

`sleepy.c`

static DECLARE_WAIT_QUEUE_HEAD(wq);

wait_event_interruptible(wq, flag != 0);

wake_up_interruptible(&wq);

Test

# sudo insmod sleepy.ko
# sudo mknod -m og+rw  /dev/sleepy c  XXX 0
# cat /dev/sleepy
# echo "X" > /dev/sleepy

[11010.693929] process 4770 (cat) going to sleep
[11063.180845] process 4772 (zsh) awakening the readers...
[11063.180861] awoken 4770 (cat)

`jit.c`

Error and Fix

jit.c: In function ‘jit_fn’:
jit.c:73: error: implicit declaration of function ‘schedule’
jit.c:77: error: ‘TASK_INTERRUPTIBLE’ undeclared (first use in this function)
jit.c:77: error: (Each undeclared identifier is reported only once
jit.c:77: error: for each function it appears in.)
jit.c:77: error: implicit declaration of function ‘signal_pending’
jit.c:77: error: implicit declaration of function ‘schedule_timeout’
jit.c:80: error: implicit declaration of function ‘set_current_state’

#include <linux/sched.h>

tasklet_hi_schedule

chapter 7.5

API

void tasklet_hi_schedule(struct tasklet_struct *t);

Schedule the tasklet for execution with higher priority. When the soft interrupt
handler runs, it deals with high-priority tasklets before other soft interrupt tasks,
including “normal” tasklets. Ideally, only tasks with low-latency requirements
(such as filling the audio buffer) should use this function, to avoid the additional
latencies introduced by other soft interrupt handlers. Actually, /proc/
jitasklethi shows no human-visible difference from /proc/jitasklet.

Test

/proc/currentime

0x00a7d291 0x0000000100a7d291 1350830993.423880
                              1350830993.423499790
0x00a7d291 0x0000000100a7d291 1350830993.423881
                              1350830993.423499790
0x00a7d291 0x0000000100a7d291 1350830993.423882
                              1350830993.423499790
0x00a7d291 0x0000000100a7d291 1350830993.423883
                              1350830993.423499790
0x00a7d291 0x0000000100a7d291 1350830993.423885
                              1350830993.423499790
0x00a7d291 0x0000000100a7d291 1350830993.423886
                              1350830993.423499790
0x00a7d291 0x0000000100a7d291 1350830993.423887
                              1350830993.423499790
0x00a7d291 0x0000000100a7d291 1350830993.423888
                              1350830993.423499790
0x00a7d291 0x0000000100a7d291 1350830993.423890
                              1350830993.423499790
0x00a7d291 0x0000000100a7d291 1350830993.423891
                              1350830993.423499790

struct timeval tv1;
struct timespec tv2;
unsigned long j1;
u64 j2;

/* get them four */
j1 = jiffies;
j2 = get_jiffies_64();
do_gettimeofday(&tv1);
tv2 = current_kernel_time();

/* print */
len=0;
len += sprintf(buf,"0x%08lx 0x%016Lx %10i.%06i\n"
               "%40i.%09i\n",
               j1, j2,
               (int) tv1.tv_sec, (int) tv1.tv_usec,
               (int) tv2.tv_sec, (int) tv2.tv_nsec);

/proc/jitbusy chapter 7.3

# dd bs=20 count=5 < /proc/jitbusy
 11160469  11160719
 11160719  11160969
 11160969  11161219
 11161219  11161469
 11161469  11161719

j0 = jiffies;
j1 = j0 + delay;

while (time_before(jiffies, j1))
  cpu_relax();

/proc/jitsched

dd bs=20 count=5 < /proc/jitsched
 20901444  20901694
 20901694  20901944
 20901944  20902194
 20902194  20902444
 20902444  20902694

  j0 = jiffies;
  j1 = j0 + delay;

while (time_before(jiffies, j1)) {
  schedule();
 }

/proc/jitqueue

#dd bs=20 count=5 < /proc/jitqueue 
 20928572  20928822
 20928822  20929072
 20929072  20929322
 20929322  20929572
 20929572  20929822

wait_event_interruptible_timeout(wait, 0, delay);

/proc/jitschedto

dd bs=20 count=5 < /proc/jitschedto 
20955019  20955269
20955269  20955519
20955519  20955769
20955769  20956019
20956019  20956269

set_current_state(TASK_INTERRUPTIBLE);
schedule_timeout (delay);

/proc/jitimer

# cat /proc/jitimer 
   time   delta  inirq    pid   cpu command
 20978101    0     0     19300   5   cat
 20978111   10     1         0   5   swapper
 20978121   10     1         0   5   swapper
 20978131   10     1         0   5   swapper
 20978141   10     1         0   5   swapper
 20978151   10     1         0   5   swapper

init_timer(&data->timer);
init_waitqueue_head (&data->wait);
data->loops = JIT_ASYNC_LOOPS;

/* register the timer */
data->timer.data = (unsigned long)data;
data->timer.function = jit_timer_fn;
data->timer.expires = j + tdelay; /* parameter */
add_timer(&data->timer);

/* wait for the buffer to fill */
wait_event_interruptible(data->wait, !data->loops);

/proc/jitasklet

# cat /proc/jitasklet
   time   delta  inirq    pid   cpu command
 21042512    0     0     19332   6   cat
 21042512    0     1        22   6   ksoftirqd/6
 21042512    0     1        22   6   ksoftirqd/6
 21042512    0     1        22   6   ksoftirqd/6
 21042512    0     1        22   6   ksoftirqd/6
 21042512    0     1        22   6   ksoftirqd/6

/* register the tasklet */
tasklet_init(&data->tlet, jit_tasklet_fn, (unsigned long)data);
data->hi = hi;
if (hi)
  tasklet_hi_schedule(&data->tlet);
 else
   tasklet_schedule(&data->tlet);

/* wait for the buffer to fill */
wait_event_interruptible(data->wait, !data->loops);

/proc/jitasklethi

# cat /proc/jitasklethi 
   time   delta  inirq    pid   cpu command
 21047810    0     0     19340   6   cat
 21047810    0     1        22   6   ksoftirqd/6
 21047810    0     1        22   6   ksoftirqd/6
 21047810    0     1        22   6   ksoftirqd/6
 21047810    0     1        22   6   ksoftirqd/6
 21047810    0     1        22   6   ksoftirqd/6

/* register the tasklet */
tasklet_init(&data->tlet, jit_tasklet_fn, (unsigned long)data);
data->hi = hi;
if (hi)
  tasklet_hi_schedule(&data->tlet);
 else
   tasklet_schedule(&data->tlet);

/* wait for the buffer to fill */
wait_event_interruptible(data->wait, !data->loops);

`seq.c`

seq_file

chapter 4.3 The seq_file interface assumes that you are creating a virtual file that steps through a sequence of items that must be returned to user space. To use seq_file, you must create a simple "iterator" object that can establish a position within the sequence, step forward, and output one item in the sequence. It may sound complicated, but, in fact, the process is quite simple.

The first step, inevitably, is the inclusion of <linux/seq_file.h>. Then you must create four iterator methods, called start, next, stop, and show.

void *start(struct seq_file *sfile, loff_t *pos); The sfile argument can almost always be ignored. pos is an integer position indicating where the reading should start. The interpretation of the position is entirely up to the implementation; it need not be a byte position in the resulting file.

The next function should move the iterator to the next position, returning NULL if there is nothing left in the sequence. This method's prototype is: void *next(struct seq_file *sfile, void *v, loff_t *pos); Here, v is the iterator as returned from the previous call to start or next, and pos is the current position in the file. next should increment the value pointed to by pos; depending on how your iterator works, you might (though probably won't) want to increment pos by more than one.

When the kernel is done with the iterator, it calls stop to clean up: void stop(struct seq_file *sfile, void *v);

In between these calls, the kernel calls the show method to actually output something interesting to the user space. This method's prototype is: int show(struct seq_file *sfile, void *v);

Now that it has a full set of iterator operations, seq must package them up and connect them to a file in /proc. The first step is done by filling in a seq_operations structure:

static struct seq_operations ct_seq_ops = {
        .start = ct_seq_start,
        .next  = ct_seq_next,
        .stop  = ct_seq_stop,
        .show  = ct_seq_show
};

With that structure in place, we must create a file implementation that the kernel understands. We do not use the read_proc method described previously; when using seq_file, it is best to connect in to /proc at a slightly lower level. That means creating a file_operations structure (yes, the same structure used for char drivers) implementing all of the operations needed by the kernel to handle reads and seeks on the file. Fortunately, this task is straightforward. The first step is to create an open method that connects the file to the seq_file operations:

/*
 * Time to set up the file operations for our /proc file.  In this case,
 * all we need is an open function which sets up the sequence ops.
 */

static int ct_open(struct inode *inode, struct file *file)
{
        return seq_open(file, &ct_seq_ops);
};

The call to seq_open connects the file structure with our sequence operations defined above. As it turns out, open is the only file operation we must implement ourselves, so we can now set up our file_operations structure:

static struct file_operations ct_file_ops = {
        .owner   = THIS_MODULE,
        .open    = ct_open,
        .read    = seq_read,
        .llseek  = seq_lseek,
        .release = seq_release
};

The final step is to create the actual file in /proc:

entry = create_proc_entry("scullseq", 0, NULL);
if (entry)
    entry->proc_fops = &scull_proc_ops;

API

void *start(struct seq_file *sfile, loff_t *pos);
void *next(struct seq_file *sfile, void *v, loff_t *pos)
void stop(struct seq_file *sfile, void *v);
int show(struct seq_file *sfile, void *v)

In between these calls, the kernel calls the show method to actually output something interesting to the user space. This method's prototype is:

int show(struct seq_file *sfile, void *v);

This method should create output for the item in the sequence indicated by the iterator v. It should not use printk, however; instead, there is a special set of functions for seq_file output:

int seq_printf(struct seq_file *sfile, const char *fmt, ...);
    This is the printf equivalent for seq_file implementations; it takes the usual format string and additional value arguments. You must also pass it the seq_file structure given to the show function, however. If seq_printf returns a nonzero value, it means that the buffer has filled, and output is being discarded. Most implementations ignore the return value, however.

int seq_putc(struct seq_file *sfile, char c);

int seq_puts(struct seq_file *sfile, const char *s);
    These are the equivalents of the user-space putc and puts functions.

int seq_escape(struct seq_file *m, const char *s, const char *esc);
    This function is equivalent to seq_puts with the exception that any character in s that is also found in esc is printed in octal format. A common value for esc is " \t\n\\", which keeps embedded white space from messing up the output and possibly confusing shell scripts.

int seq_path(struct seq_file *sfile, struct vfsmount *m, struct dentry
    *dentry, char *esc);
    This function can be used for outputting the file name associated with a given directory entry. It is unlikely to be useful in device drivers; we have included it here for completeness.

Ref

<linux/seq_file.h>

struct seq_file {
        char *buf;
        size_t size;
        size_t from;
        size_t count;
        loff_t index;
        loff_t read_pos;
        u64 version;
        struct mutex lock;
        const struct seq_operations *op;
        void *private;
};

struct seq_operations {
        void * (*start) (struct seq_file *m, loff_t *pos);
        void (*stop) (struct seq_file *m, void *v);
        void * (*next) (struct seq_file *m, void *v, loff_t *pos);
        int (*show) (struct seq_file *m, void *v);
};

test

# cat /proc/sequence
0
1
2
3
4
5
6
7
8
9
10
11
12
13
.........

`silly.c`

chapter 9.4

I/O Ports and I/O Memory

API

#include <asm/io.h>
void *ioremap(unsigned long phys_addr, unsigned long size);
void *ioremap_nocache(unsigned long phys_addr, unsigned long size);
void iounmap(void *virt_addr);
   /* ioremap remaps a physical address range into the processor's
virtual address space, making it available to the kernel. iounmap
frees the mapping when it is no longer needed.*/

#include <asm/io.h>
unsigned int ioread8(void *addr);
unsigned int ioread16(void *addr);
unsigned int ioread32(void *addr);
void iowrite8(u8 value, void *addr);
void iowrite16(u16 value, void *addr);
void iowrite32(u32 value, void *addr);
    Accessor functions that are used to work with I/O memory.

void ioread8_rep(void *addr, void *buf, unsigned long count);
void ioread16_rep(void *addr, void *buf, unsigned long count);
void ioread32_rep(void *addr, void *buf, unsigned long count);
void iowrite8_rep(void *addr, const void *buf, unsigned long count);
void iowrite16_rep(void *addr, const void *buf, unsigned long count);
void iowrite32_rep(void *addr, const void *buf, unsigned long count);
    /* "Repeating" versions of the I/O memory primitives.*/

unsigned readb(address);
unsigned readw(address);
unsigned readl(address);
void writeb(unsigned value, address);
void writew(unsigned value, address);
void writel(unsigned value, address);

memset_io(address, value, count);
memcpy_fromio(dest, source, nbytes);
memcpy_toio(dest, source, nbytes);
    Older, type-unsafe functions for accessing I/O memory.

void *ioport_map(unsigned long port, unsigned int count);
void ioport_unmap(void *addr);
    A driver author that wants to treat I/O ports as if they were I/O memory may pass those ports to ioport_map. The mapping should be done (with ioport_unmap) when no longer needed.

misc-progs

dataalign.c

SRC

#include <sys/utsname.h>

struct utsname name;
uname(&name);  //=> name.machine

struct c   {char c;  char      t;} c;
struct s   {char c;  short     t;} s;
struct i   {char c;  int       t;} i;
struct l   {char c;  long      t;} l;
struct ll  {char c;  long long t;} ll;
struct p   {char c;  void *    t;} p;
struct u1b {char c;  __u8      t;} u1b;
struct u2b {char c;  __u16     t;} u2b;
struct u4b {char c;  __u32     t;} u4b;
struct u8b {char c;  __u64     t;} u8b;
(int)((void *)(&c.t)   - (void *)&c),

  struct c2   {char c;  char      t;} c2;
struct s2   {short c;  char     t;} s2;
struct i2   {int  c;  char       t;} i2;
struct l2   {long c;  char       t;} l2;
struct ll2  {long long c;  char  t;} ll2;
struct p2   {void * c;  char  t;} p2;
struct u1b2 {__u8 c;  char   t;} u1b2;
struct u2b2 {__u16 c; char   t;} u2b2;
struct u4b2 {__u32 c; char   t;} u4b2;
struct u8b2 {__u64 c; char   t;} u8b2;
(int)((void *)(&c2.t)   - (void *)&c2)

Test

arch  Align:  char  short  int  long   ptr long-long  u8 u16 u32 u64
i686            1     2     4     4     4     4        1   2   4   4

arch  Align:  char  short  int  long   ptr long-long  u8 u16 u32 u64
i686            1     2     4     4     4     8        1   2   4   8

datasize.c

(int)sizeof(char), (int)sizeof(short), (int)sizeof(int),
           (int)sizeof(long),
           (int)sizeof(void *), (int)sizeof(long long), (int)sizeof(__u8),
           (int)sizeof(__u16), (int)sizeof(__u32), (int)sizeof(__u64));

arch   Size:  char  short  int  long   ptr long-long  u8 u16 u32 u64
i686            1     2     4     4     4     8        1   2   4   8

asynctest.c

Asynchronous Notification

chapter 6.4

SRC

struct sigaction action;

memset(&action, 0, sizeof(action));
action.sa_handler = sighandler;
action.sa_flags = 0;

sigaction(SIGIO, &action, NULL);

fcntl(STDIN_FILENO, F_SETOWN, getpid());
fcntl(STDIN_FILENO, F_SETFL, fcntl(STDIN_FILENO, F_GETFL) | FASYNC);

void sighandler(int signo){}

Overview

Ref

struct sigaction

<usr/include/signal.h>
# include <bits/sigaction.h>

/* Structure describing the action to be taken when a signal arrives.  */
struct sigaction
  {
    /* Signal handler.  */
#ifdef __USE_POSIX199309
    union
      {
        /* Used if SA_SIGINFO is not set.  */
        __sighandler_t sa_handler;
        /* Used if SA_SIGINFO is set.  */
        void (*sa_sigaction) (int, siginfo_t *, void *);
      }
    __sigaction_handler;
# define sa_handler     __sigaction_handler.sa_handler
# define sa_sigaction   __sigaction_handler.sa_sigaction
#else
    __sighandler_t sa_handler;
#endif

    /* Additional set of signals to be blocked.  */
    __sigset_t sa_mask;

    /* Special flags.  */
    int sa_flags;

    /* Restore handler.  */
    void (*sa_restorer) (void);
  };

/* Bits in `sa_flags'.  */
#define SA_NOCLDSTOP  1          /* Don't send SIGCHLD when children stop.  */
#define SA_NOCLDWAIT  2          /* Don't create zombie on child death.  */
#define SA_SIGINFO    4          /* Invoke signal-catching function with
                                    three arguments instead of one.  */

sigaction()

/* Get and/or set the action for signal SIG.  */
extern int sigaction (int __sig, __const struct sigaction *__restrict __act,
                      struct sigaction *__restrict __oact) __THROW;

fcntl

<fcntl.h>
/* Do the file control operation described by CMD on FD.
   The remaining arguments are interpreted depending on CMD.

   This function is a cancellation point and therefore not marked with
   __THROW.  */
extern int fcntl (int __fd, int __cmd, ...);

#include <bits/fcntl.h>

/* Values for the second argument to `fcntl'.  */
#define F_DUPFD         0       /* Duplicate file descriptor.  */
#define F_GETFD         1       /* Get file descriptor flags.  */
#define F_SETFD         2       /* Set file descriptor flags.  */
#define F_GETFL         3       /* Get file status flags.  */
#define F_SETFL         4       /* Set file status flags.  */
#if __WORDSIZE == 64
# define F_GETLK        5       /* Get record locking info.  */
# define F_SETLK        6       /* Set record locking info (non-blocking).  */
# define F_SETLKW       7       /* Set record locking info (blocking).  */
/* Not necessary, we always have 64-bit offsets.  */
# define F_GETLK64      5       /* Get record locking info.  */
# define F_SETLK64      6       /* Set record locking info (non-blocking).  */
# define F_SETLKW64     7       /* Set record locking info (blocking).  */
#else
# ifndef __USE_FILE_OFFSET64
#  define F_GETLK       5       /* Get record locking info.  */
#  define F_SETLK       6       /* Set record locking info (non-blocking).  */
#  define F_SETLKW      7       /* Set record locking info (blocking).  */
# else
#  define F_GETLK       F_GETLK64  /* Get record locking info.  */
#  define F_SETLK       F_SETLK64  /* Set record locking info (non-blocking).*/
#  define F_SETLKW      F_SETLKW64 /* Set record locking info (blocking).  */
# endif
# define F_GETLK64      12      /* Get record locking info.  */
# define F_SETLK64      13      /* Set record locking info (non-blocking).  */
# define F_SETLKW64     14      /* Set record locking info (blocking).  */
#endif

#if defined __USE_BSD || defined __USE_UNIX98
# define F_SETOWN       8       /* Get owner (process receiving SIGIO).  */
# define F_GETOWN       9       /* Set owner (process receiving SIGIO).  */
#endif

#ifdef __USE_GNU
# define F_SETSIG       10      /* Set number of signal to be sent.  */
# define F_GETSIG       11      /* Get number of signal to be sent.  */
# define F_SETOWN_EX    15      /* Get owner (thread receiving SIGIO).  */
# define F_GETOWN_EX    16      /* Set owner (thread receiving SIGIO).  */
#endif

#ifdef __USE_GNU
# define F_SETLEASE     1024    /* Set a lease.  */
# define F_GETLEASE     1025    /* Enquire what lease is active.  */
# define F_NOTIFY       1026    /* Request notfications on a directory.  */
# define F_DUPFD_CLOEXEC 1030   /* Duplicate file descriptor with
                                   close-on-exit set.  */
#endif

gdbline

inp.c

load50.c

chapter 7.3 This program forks a number of processes that do nothing, but do it in a CPU-intensive way. The program is part of the sample files accompanying this book, and forks 50 processes by default, although the number can be specified on the command line.

Linux device drivers Notes

Table of Contents

General

Make

Generate Files

Some important Data Structures

Files

Test

misc-modules

hello.c

hellop.c

complete.c

faulty.c

jiq.c

kdataalign.c

kdatasize.c

sleepy.c

jit.c

seq.c

silly.c

misc-progs

dataalign.c

datasize.c

asynctest.c

gdbline

inp.c

load50.c

mapcmp.c

mapper.c

nbtest.c

netifdebug.c

outp.c

polltest.c

setconsole.c

setlevel.c

skull

scull

short

scullc

sculld

scullp

scullv

simple

shortprint

pci

usb

lddbus

sbull

snull

tty

`hello.c`

`hellop.c`

`complete.c`

`faulty.c`

`jiq.c`

`kdataalign.c`

`kdatasize.c`

`sleepy.c`

`jit.c`

`seq.c`

`silly.c`